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Preface 


It was our privilege to serve as the program chairs for CAV 2023, the 35th International 
Conference on Computer-Aided Verification. CAV 2023 was held during July 19-22, 
2023 and the pre-conference workshops were held during July 17—18, 2023. CAV 2023 
was an in-person event, in Paris, France. 

CAV is an annual conference dedicated to the advancement of the theory and practice 
of computer-aided formal analysis methods for hardware and software systems. The 
primary focus of CAV is to extend the frontiers of verification techniques by expanding 
to new domains such as security, quantum computing, and machine learning. This puts 
CAV at the cutting edge of formal methods research, and this year’s program is areflection 
of this commitment. 

CAV 2023 received a large number of submissions (261). We accepted 15 tool 
papers, 3 case-study papers, and 49 regular papers, which amounts to an acceptance 
rate of roughly 26%. The accepted papers cover a wide spectrum of topics, from theo- 
retical results to applications of formal methods. These papers apply or extend formal 
methods to a wide range of domains such as concurrency, machine learning and neu- 
ral networks, quantum systems, as well as hybrid and stochastic systems. The program 
featured keynote talks by Ruzica Piskac (Yale University), Sumit Gulwani (Microsoft), 
and Caroline Trippel (Stanford University). In addition to the contributed talks, CAV 
also hosted the CAV Award ceremony, and a report from the Synthesis Competition 
(SYNTCOMP) chairs. 

In addition to the main conference, CAV 2023 hosted the following workshops: Meet- 
ing on String Constraints and Applications (MOSCA), Verification Witnesses and Their 
Validation (VeWit), Verification of Probabilistic Programs (VeriProP), Open Problems 
in Learning and Verification of Neural Networks (WOLVERINE), Deep Learning-aided 
Verification (DAV), Hyperproperties: Advances in Theory and Practice (HYPER), Syn- 
thesis (SYNT), Formal Methods for ML-Enabled Autonomous Systems (FoOMLAS), and 
Verification Mentoring Workshop (VMW). CAV 2023 also hosted a workshop dedicated 
to Thomas A. Henzinger for this 60th birthday. 

Organizing a flagship conference like CAV requires a great deal of effort from the 
community. The Program Committee for CAV 2023 consisted of 76 members—a com- 
mittee of this size ensures that each member has to review only a reasonable number of 
papers in the allotted time. In all, the committee members wrote over 730 reviews while 
investing significant effort to maintain and ensure the high quality of the conference pro- 
gram. We are grateful to the CAV 2023 Program Committee for their outstanding efforts 
in evaluating the submissions and making sure that each paper got a fair chance. Like 
recent years in CAV, we made artifact evaluation mandatory for tool paper submissions, 
but optional for the rest of the accepted papers. This year we received 48 artifact submis- 
sions, out of which 47 submissions received at least one badge. The Artifact Evaluation 
Committee consisted of 119 members who put in significant effort to evaluate each arti- 
fact. The goal of this process was to provide constructive feedback to tool developers and 
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help make the research published in CAV more reproducible. We are also very grateful 
to the Artifact Evaluation Committee for their hard work and dedication in evaluating 
the submitted artifacts. 

CAV 2023 would not have been possible without the tremendous help we received 
from several individuals, and we would like to thank everyone who helped make CAV 
2023 a success. We would like to thank Alessandro Cimatti, Isil Dillig, Javier Esparza, 
Azadeh Farzan, Joost-Pieter Katoen and Corina Pasareanu for serving as area chairs. 
We also thank Bernhard Krag] and Daniel Dietsch for chairing the Artifact Evaluation 
Committee. We also thank Mohamed Faouzi Atig for chairing the workshop organization 
as well as leading publicity efforts, Eric Koskinen as the fellowship chair, Sebastian 
Bardin and Ruzica Piskac as sponsorship chairs, and Srinidhi Nagendra as the website 
chair. Srinidhi, along with Enrique Roman Calvo, helped prepare the proceedings. We 
also thank Ankush Desai, Eric Koskinen, Burcu Kulahcioglu Ozkan, Marijana Lazic, and 
Matteo Sammartino for chairing the mentoring workshop. Last but not least, we would 
like to thank the members of the CAV Steering Committee (Kenneth McMillan, Aarti 
Gupta, Orna Grumberg, and Daniel Kroening) for helping us with several important 
aspects of organizing CAV 2023. 

We hope that you will find the proceedings of CAV 2023 scientifically interesting 
and thought-provoking! 
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Abstract. Bitwuzla is a new SMT solver for the quantifier-free and 
quantified theories of fixed-size bit-vectors, arrays, floating-point arith- 
metic, and uninterpreted functions. This paper serves as a comprehen- 
sive system description of its architecture and components. We evaluate 
Bitwuzla’s performance on all benchmarks of supported logics in SMT- 
LIB and provide a comparison against other state-of-the-art SMT solvers. 


1 Introduction 


Satisfiability Modulo Theories (SMT) solvers serve as back-end reasoning engines 
for a wide range of applications in formal methods (e.g., [13,14,21,23,35]). In 
particular, the theory of fixed-size bit-vectors, in combination with arrays, unin- 
terpreted functions and floating-point arithmetic, have received increasing inter- 
est in recent years, as witnessed by the high and increasing numbers of bench- 
marks submitted to the SMT-LIB benchmark library [5] and the number of 
participants in corresponding divisions in the annual SMT competition (SMT- 
COMP) [42]. State-of-the-art SMT solvers supporting (a subset of) these the- 
ories include Boolector [31], cvc5 [3], MathSAT [15], STP [19], Yices2 [17] and 
Z3 [25]. Among these, Boolector had been largely dominating the quantifier-free 
divisions with bit-vectors and arrays in SMT-COMP over the years [2]. 
Boolector was originally published in 2009 by Brummayer and Biere [11] as 
an SMT solver for the quantifier-free theories of fixed-size bit-vectors and arrays. 
Since 2012, Boolector has been mainly developed and maintained by the authors 
of this paper, who have extended it with support for uninterpreted functions and 
lazy handling of non-recursive lambda terms [32,38,39], local search strategies 
for quantifier-free bit-vectors [33,34], and quantified bit-vector formulas [40]. 
While Boolector is still competitive in terms of performance, it has several 
limitations. Its code base consists of largely monolithic C code, with a rigid 
architecture focused on a very specialized, tight integration of bit-vectors and 
arrays. Consequently, it is cumbersome to maintain, and adding new features 
is difficult and time intensive. Further, Boolector requires manual management 
of memory and reference counts from API users; terms and sorts are tied to a 
specific solver instance and cannot be shared across instances; all preprocessing 


This work was supported in part by the Stanford Center for Automated Reasoning, 
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techniques are destructive, which disallows incremental preprocessing; and due to 
architectural limitations, incremental solving with quantifiers is not supported. 

In 2018, we forked Boolector in preparation for addressing these issues, and 
entered an improved and extended version of this fork as Bitwuzla in the SMT 
competition 2020 [26]. At that time, Bitwuzla extended Boolector with: sup- 
port for floating-point arithmetic by integrating SymFPU [8] (a C++ library of 
bit-vector encodings of floating-point operations); a novel generalization of its 
propagation-based local search strategy [33] to ternary values [27]; unsat core 
extraction; and since 2022, support for reasoning about quantified formulas for all 
supported theories and their combinations. This version of Bitwuzla was already 
made available on GitHub at [28], but not officially released. However, archi- 
tectural and structural limitations inherited from Boolector remained. Thus, to 
overcome these limitations and address the above issues, we decided to discard 
the existing code base and rewrite Bitwuzla from scratch. 

In this paper, we present the first official release of Bitwuzla, an SMT solver 
for the (quantified and quantifier-free) theories of fixed-size bit-vectors, arrays, 
floating-point arithmetic, uninterpreted functions and their combinations. Its 
name (pronounced as bitvootsiah) is derived from an Austrian dialect expression 
that can be translated as someone who tinkers with bits. Bitwuzla is written 
in C++, inspired by techniques implemented in Boolector. That is, rather than 
only redesigning problematic aspects of Boolector, we carefully dissected and 
(re)evaluated its parts to serve as guidance when writing a new solver from 
scratch. In that sense, it is not a reimplementation of Boolector, but can be 
considered its superior successor. Bitwuzla is available on GitHub [28] under the 
MIT license, and its documentation is available at [29]. 


2 Architecture 


Bitwuzla supports reasoning about quantifier-free and quantified formulas over 
fixed-size bit-vectors, floating-point arithmetic, arrays and uninterpreted func- 
tions as standardized in SMT-LIB [4]. In this section, we provide an overview of 
Bitwuzla’s system architecture and its core components as given in Fig. 1. 
Bitwuzla consists of two main components: the Solving Context and 
the Node Manager. The Solving Context can be seen as a solver instance 
that determines satisfiability of a set of formulas and implements the lazy, 
abstraction/refinement-based SMT paradigm lemmas on demand [6,24] (in con- 
trast to SMT solvers like cvc5 and Z3, which are based on the CDCL(T) [36] 
framework). The Node Manager is responsible for constructing and maintaining 
nodes and types and is shared across multiple Solving Context instances. 
Bitwuzla provides a comprehensive C++ API as its main interface, with a C 
and Python API built on top. All features of the C++ API are also accessible 
to C and Python users. The API documentation is available at [29]. The C++ 
API exports Term, Sort, Bitwuzla, and Option classes for constructing nodes 
and types, configuring solver options and constructing Bitwuzla solver instances 
(the external representation of Solving Contexts). Term and Sort objects may be 
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Fig. 1. Bitwuzla system architecture. 


used in multiple Bitwuzla instances. The parser interacts with the solver instance 
via the C++ API. A textual command line interface (CLI) builds on top of the 
parser, supporting SMT-LIBv2 [4] and BTOR2 [35] as input languages. 


2.1 Node Manager 


Bitwuzla represents formulas and terms as reference-counted, immutable nodes 
in a directed acyclic graph. The Node Manager is responsible for constructing 
and managing these nodes and employs hash-consing to maximize sharing of 
subgraphs. Automatic reference counting allows the Node Manager to determine 
when to delete nodes. Similarly, types are constructed and managed by the Type 
Manager, which is maintained by the Node Manager. Nodes and types are stored 
globally (thread-local) in the Node Database and Type Database, which has the 
key advantage that they can be shared between arbitrarily many solving contexts 
within one thread. This is one of the key differences to Boolector’s architecture, 
where terms and types are manually reference counted and tied to a single solver 
instance, which does not allow sharing between solver instances. 


2.2 Solving Context 


A Solving Context is the internal equivalent of a solver instance and deter- 
mines the satisfiability of a set of asserted formulas (assertions). Solving Con- 
texts are fully configurable via options and provide an incremental interface for 
adding and removing assertions via push and pop. Incremental solving allows 
users to perform multiple satisfiability checks with similar sets of assertions 
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while reusing work from earlier checks. On the API level, Bitwuzla also sup- 
ports satisfiability queries under a given set of assumptions (SMT-LIB command 
check-sat-assuming), which are internally handled via push and pop. 

Nodes and types constructed via the Node Manager may be shared between 
multiple Solving Contexts. If the set of assertions is satisfiable, the Solving Con- 
text provides a model for the input formula. It further allows to query model 
values for any term, based on this model (SMT-LIB command get-value). In 
case of unsatisfiable queries, the Solving Context can be configured to extract 
an unsatisfiable core and unsat assumptions. 

A Solving Context consists of three main components: a Rewriter, a Prepro- 
cessor and a Solver Engine. The Rewriter and Preprocessor perform local (node 
level) and global (over all assertions) simplifications, whereas the Solver Engine 
is the central solving engine, managing theory solvers and their interaction. 


Preprocessor. As a first step of each satisfiability check, prior to solving, the 
preprocessor applies a pipeline of preprocessing passes in a predefined order 
to the current set of assertions until fixed-point. Each preprocessing pass imple- 
ments a set of satisfiability-preserving transformations. All passes can be option- 
ally disabled except for one mandatory transformation, the reduction of the full 
set of operators supported on the API level to a reduced operator set: Boolean 
connectives are expressed by means of {—, A}, quantifier J is represented in terms 
of V, inequalities are represented in terms of < and >, signed bit-vector oper- 
ators are expressed in terms of unsigned operators, and more. These reduction 
transformations are a subset of the term rewrites performed by the Rewriter, 
and rewriting is implemented as one preprocessing pass. Additionally, Bitwuzla 
implements 7 preprocessing passes, which are applied sequentially, after rewrit- 
ing, until no further transformations are possible: and flattening, which splits 
a top-level A into its subformulas, e.g., a A (b A (c = d)) into {a, b, c = d}; 
substitution, which replaces all occurrences of a constant x with a term tif x = t 
is derived on the top level; skeleton preprocessing, which simplifies the Boolean 
skeleton of the input formula with a SAT solver; embedded constraints, which 
substitutes all occurrences of top-level constraints in subterms of other top-level 
constraints with true; extract elimination, which eliminates bit-vector extracts 
over constants; lambda elimination, which applies beta reduction on lambda 
terms; and normalization of arithmetic expressions. 

Preprocessing in Bitwuzla is fully incremental: all passes are applied to the 
current set of assertions, from all assertion levels, and simplifications derived 
from lower levels are applied to all assertions of higher levels (including assump- 
tions). Assertions are processed per assertion level 7, starting from i = 0, and for 
each level i > 0, simplifications are applied based on information from all levels 
j <i. Note that when solving under assumptions, Bitwuzla internally pushes an 
assertion level and handles these assumptions as assertions of that level. When 
a level 7 is popped, the assertions of that level are popped, and the state of the 
preprocessor is backtracked to the state that was associated with level ¿— 1. Note 
that preprocessing assertion levels 1 < j with information derived from level j 
requires to not only restore the state of the preprocessor, but to also reconstruct 
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the assertions on levels 7 < j when level j is popped to the state before level j 
was pushed, and is left to future work. 

Boolector, on the other hand, only performs preprocessing based on top- 
level assertions (assertion level 0) and does not incorporate any information 
from assumptions or higher assertion levels. 


Rewriter. The rewriter transforms terms via a predefined set of rewrite rules 
into semantically equivalent normal forms. This transformation is local in the 
sense that it is independent from the current set of assertions. We distinguish 
between required and optional rewrite rules, and further group rules into so- 
called rewrite levels from 0-2. The set of required rules consists of operator 
elimination rewrites, which are considered level 0 rewrites and ensure that nodes 
only contain operators from a reduced base set. For example, the two’s com- 
plement —z of a bit-vector term x is rewritten to (~x + 1) by means of one’s 
complement and bit-vector addition. Optional rewrite rules are grouped into 
level 1 and level 2. Level 1 rules perform rewrites that only consider the imme- 
diate children of a node, whereas level 2 rules may consider multiple levels of 
children. If not implemented carefully, level 2 rewrites can potentially destroy 
sharing of subterms and consequently increase the overall size of the formula. 
For example, rewriting (t + 0) to t is considered a level 1 rewrite rule, whereas 
rewriting (a — b = c) to (b + c = a) is considered a level 2 rule since it may 
introduce an additional bit-vector addition (b + c) if (a — b) occurs somewhere 
else in the formula. The maximum rewrite level of the rewriter can be configured 
by the user. 

Rewriting is applied on the current set of assertions as a preprocessing pass 
and, as all other passes, applied until fixed-point. That is, on any given term, 
the rewriter applies rewrite rules until no further rewrite rules can be applied. 
For this, the rewriter must guarantee that no set of applied rewrite rules may 
lead to cyclic rewriting of terms. Additionally, all components of the solving 
context apply rewriting on freshly created nodes to ensure that all nodes are 
always fully normalized. In order to avoid processing nodes more than once, the 
rewriter maintains a cache that maps nodes to their fully rewritten form. 


Solver Engine. After preprocessing, the solving context sends the current set 
of assertions to the Solver Engine, which implements a lazy SMT paradigm 
called lemmas on demand [6,24]. However, rather than using a propositional 
abstraction of the input formula as in [6,24], it implements a bit-vector abstrac- 
tion similar to Boolector [12,38]. At its core, the Solver Engine maintains a 
bit-vector theory solver and a solver for each supported theory. Quantifier rea- 
soning is handled by a dedicated quantifiers module, implemented as a theory 
solver. The Solver Engine manages all theory solvers, the distribution of relevant 
terms, and the processing of lemmas generated by the theory solvers. 

The bit-vector solver is responsible for reasoning about the bit-vector abstrac- 
tion of the input assertions and lemmas generated during solving, which includes 
all propositional and bit-vector terms. Theory atoms that do not belong to 
the bit-vector theory are abstracted as Boolean constants, and bit-vector terms 
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whose operator does not belong to the bit-vector theory are abstracted as bit- 
vector constants. For example, an array select operation of type bit-vector is 
abstracted as a bit-vector constant, while an equality between two arrays is 
abstracted as a Boolean constant. 

If the bit-vector abstraction is satisfiable, the bit-vector solver produces a sat- 
isfying assignment, and the floating-point, array, function and quantifier solvers 
check this assignment for theory consistency. If a solver finds a theory inconsis- 
tency, i.e., a conflict between the current satisfying assignment and the solver’s 
theory axioms, it produces a lemma to refine the bit-vector abstraction and rule 
out the detected inconsistency. Theory solvers are allowed to send any number 
of lemmas, with the only requirement that if a theory solver does not send a 
lemma, the current satisfying assignment is consistent with the theory. 

Finding a satisfying assignment for the bit-vector abstraction and the subse- 
quent theory consistency checks are implemented as an abstraction/refinement 
loop as given in Algorithm 1. Whenever a theory solver sends lemmas, the loop 
is restarted to get a new satisfying assignment for the refined bit-vector abstrac- 
tion. The loop terminates if the bit-vector abstraction is unsatisfiable, or if the 
bit-vector abstraction is satisfiable and none of the theory solvers report any the- 
ory inconsistencies. Note that the abstraction/refinement algorithm may return 
unknown if the input assertions include quantified formulas. 


Algorithm 1. Abstraction/refinement loop in Solver Engine. Function SOLVE 
(A) is called on the current set of preprocessed assertions A, which is iteratively 
refined with a set of Lemmas £. 
function SOLVE(A) 
r — Unknown, £ = 0 
repeat 
A AUL 
r, M — Tgv::SOLVE(A) > Solve bit-vector abstraction of A 
if r = UnsatT then break end if 
L — Trp::CHECK(M) > Check FP theory consistency of M 
if L Æ @ then continue end if 
L — T4::CHECK( M) > Check array theory consistency of M 
if L #4 @ then continue end if 
L + Tyr::CHECK(M) > Check UF theory consistency of M 
if £ # @ then continue end if 
L£ — Tg::CHECK(M) > Check quantified formulas in M 
until L = 0 
return r 
end function 


Backtrackable Data Structures. Every component of the Solver Context 
except for the Rewriter depends on the current set of assertions. When solving 
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incrementally, the assertion stack is modified by adding (SMT-LIB command 
push) and removing (SMT-LIB command pop) assertions. In contrast to Boolec- 
tor, Bitwuzla supports saving and restoring the internal solver state, i.e., the 
state of the Solving Context, corresponding to these push and pop operations 
by means of backtrackable data structures. These data structures are custom vari- 
ants of mutable data structures provided in the C++ standard library, extended 
with an interface to save and restore their state on push and pop calls. This 
allows the solver to take full advantage of incremental solving by reusing work 
from previous satisfiability checks and backtracking to previous states. Further, 
this enables incremental preprocessing. Bitwuzla’s backtrable data structures are 
conceptually similar to context-dependent data structures in cvc5 [3]. 


3 Theory Solvers 


The Solver Engine maintains a theory solver for each supported theory and 
implements a module for handling quantified formulas as a dedicated theory 
solver. The central engine of the Solver Engine is the bit-vector theory solver, 
which reasons about a bit-vector abstraction of the current set of input asser- 
tions, refined with lemmas generated by other theory solvers. The theories of 
fixed-size bit-vectors, arrays, floating-point arithmetic, and uninterpreted func- 
tions are combined via a model-based theory combination approach similar 
to [12,38]. 

Theory combination is based on candidate models produced by the bit-vector 
theory solver for the bit-vector abstraction (function Tgy::solve() in Algorithm 
1). For each candidate model, each theory solver checks consistency with the 
axioms of the corresponding theory (functions T,.::check() in Algorithm 1). If a 
theory solver requests a model value for a term that is not part of the current 
bit-vector abstraction, the theory solver who “owns” that term is queried for a 
value. If this value or the candidate model is inconsistent with the axioms of the 
theory querying the value, it sends a lemma to refine the bit-vector abstraction. 


3.1 Arrays 


The array theory solver implements and extends the array procedure from [12] 
with support for reasoning over (equalities of) nested arrays and non-extensional 
constant arrays. This is in contrast to Boolector, which generalizes the lemmas 
on demand procedure for extensional arrays as described in [12] to non-recursive 
first-order lambda terms [37,38], without support for nested arrays. Generalizing 
arrays to lambda terms allows to use the same procedure for arrays and uninter- 
preted functions and enables a natural, compact representation and extraction 
of extended array operations such as memset, memcpy and array initialization 
patterns as described in [39]. As an example, memset(a,i,n,e), which updates 
n elements of array a within range |i, i + n| to a value e starting from index å, 
can be represented as Aj. ite(i < j < i+ n,e,al[j]). Reasoning over equalities 
involving arbitrary lambda terms (including these operations), however, requires 
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higher-order reasoning, which is not supported by Boolector. Further, extension- 
ality over standard array operators that are represented as lambda terms (e.g., 
store) requires special handling, which makes the procedure unnecessarily com- 
plex. Bitwuzla, on the other hand, implements separate theory solvers for arrays 
and uninterpreted functions. Consequently, since it does not generalize arrays 
to lambda terms, it cannot utilize the elegant representation of Boolector for 
the extended array operations of [39]. Thus, currently, extracting and reason- 
ing about these operations is not yet supported. Instead of representing such 
operators as lambda terms, we plan to introduce specific array operators. This 
will allow a seamless integration into Bitwuzla’s array procedure, with support 
for reasoning about extensionality involving these operators. We will also add 
support for reasoning about extensional constant arrays in the near future. 


3.2 Bit-Vectors 


The bit-vector theory solver implements two orthogonal approaches: the classic 
bit-blasting technique employed by most state-of-the-art bit-vector solvers, which 
eagerly translates the current bit-vector abstraction to SAT; and the ternary 
propagation-based local search approach presented in [27]. Since local search pro- 
cedures only allow to determine satisfiability, they are particularly effective as 
a complementary strategy, in combination with (rather than instead of) bit- 
blasting [27,33]. Bitwuzla’s bit-vector solver allows to combine local search with 
bit-blasting in a sequential portfolio setting: the local search procedure is run 
until a predefined resource limit is reached before falling back on the bit-blasting 
procedure. Currently, Bitwuzla allows combining these two approaches only in 
this particular setting. We plan to explore more interleaved configurations, pos- 
sibly while sharing information between the procedures as future work. 


Bit-Blasting. Bitwuzla implements the eager reduction of the bit-vector abstrac- 
tion to propositional logic in two phases. First, it constructs an And-Inverter- 
Graph (AIG) circuit representation of the abstraction while applying AIG-level 
rewriting techniques [10]. This AIG circuit is then converted into Conjunctive 
Normal Form (CNF) via Tseitin transformation and sent to the SAT solver 
back-end. Note that for assertions from levels > 0, Bitwuzla leverages solving 
under assumptions in the SAT solver in order to be able to backtrack to lower 
assertion levels on pop. Bitwuzla supports CaDiCaL [7], CryptoMiniSat [41], 
and Kissat [7] as SAT back-ends and uses CaDiCaL as its default SAT solver. 


Local Search. Bitwuzla implements an improved version of the ternary propa- 
gation-based local search procedure described in [27]. This procedure is a gener- 
alization of the propagation-based local search approach implemented in Boolec- 
tor [33] and addresses one of its main weaknesses: its obliviousness to bits that 
can be simplified to constant values. Propagation-based local search is based 
on propagating target values from the outputs to the inputs, does not require 
bit-blasting, brute-force randomization or restarts, and lifts the concept of back- 
tracing of Automatic Test Pattern Generation (ATPG) [22] to the word-level. 
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Boolector additionally implements the stochastic local search (SLS) approach 
presented in [18], optionally augmented with a propagation-based strategy [34]. 
Bitwuzla, however, only implements our ternary propagation-based approach 
since it was shown to significantly outperform these approaches [33]. 


3.3 Floating-Point Arithmetic 


The solver for the theory of floating-point arithmetic implements an eager 
translation of floating-point atoms in the bit-vector abstraction to equisatis- 
fiable formulas in the theory of bit-vectors, a process sometimes referred to as 
word-blasting. To translate floating-point expressions to the word-level, Bitwuzla 
integrates SymFPU [9], a C++ library of bit-vector encodings of floating-point 
operations. SymFPU uses templated types for Booleans (un)signed bit-vectors, 
rounding modes and floating-point formats, which allows utilizing solver-specific 
representations. SymFPU has also been integrated into cvc5 [3]. 


3.4 Uninterpreted Functions 


For the theory of uninterpreted functions (UF), Bitwuzla implements dynamic 
Ackermannization [16], which is a lazy form of Ackermann’s reduction. The 
UF solver checks whether the current satisfying assignment of the bit-vector 
abstraction is consistent with the function congruence axiom @ = b > f(a) = 
f(b) and produces a lemma whenever the axiom is violated. 


3.5 Quantifiers 


Quantified formulas are handled by the quantifiers module, which is treated as 
a theory solver and implements model-based quantifier instantiation [20] for all 
supported theories and their combinations. In the bit-vector abstraction, quan- 
tified formulas are abstracted as Boolean constants. Based on the assignment of 
these constants, the quantifiers solver produces instantiation or Skolemization 
lemmas. If the constant is assigned to true, the quantifier is treated as univer- 
sal quantifier and the solver produces instantiation lemmas. If the constant is 
assigned to false, the solver generates a Skolemization lemma. Bitwuzla allows 
to combine quantifiers with all supported theories as well as incremental solving 
and unsat core extraction. This is in contrast to Boolector, which only supports 
sequential reasoning about quantified bit-vector formulas and, generally, does 
not provide unsat cores for unsatisfiable instances. 


4 Evaluation 


We evaluate the overall performance of Bitwuzla on all non-incremental and 
incremental benchmarks of all supported logics in SMT-LIB [5]. We further 
include logics with floating-point arithmetic that are classified as containing 
linear integer arithmetic (LRA). Bitwuzla does not support LRA reasoning, but 
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Table 1. Solved instances and total runtime on solved instances (non-incremental). 


Logic Boolector Z3 cevc5| SC22) Bitwuzla 
ABV (169) = 89 32 0 1 
ABVFP (30) = 25 19 0 16 
ABVFPLRA (75) = 47 36 0 31 
AUFBV (1,522) = 403 486 597 983 
AUFBVFP (57) = 7 21 24 39 
BV (6,045) 5,659 5,593 5,818| 5,624 5,705 
BVEP (205) = 176 171 148 188 
BVFPLRA (209) — 189 107 140 199 
FP (2,669) = 2,128 2,353| 2,513 2,481 
FPLRA (87) 7 72 51 55 83 
QF_ABV (15,084) 15,041 14,900 14,923 | 15,043 15,041 
QF_ABVFP (18,129) = 18,017 18,113 | 18,125 18,125 
QF_ABVFPLRA (74) = 69 74 34 74 
QF_AUFBV (67) 45 50 42 46 55 
QF_AUFBVFP (1) - 1 1 1 1 
QF_BV (42,472) 41,958 40,876 41,574 | 42,039 42,049 
QF_BVFP (17,244) -| 17,229] 17,238 | 17,242 17,241 
QF_FP (40,409) = 40,303 40,357 | 40,368 40,358 
QF_FPLRA (57) E 41 48 56 56 
QF_UFBV (1,434) 1,403 1,404 1,387| 1,413 1,411 
QF_UFFP (2) - 2 2 2 2 
UFBV (192) = 156 141 146 147 
UFBVFP (2) = 1 1 1 1 
Total (146,235) 64,106| 141,778| 142,995 | 143,617 | 144,287 
Time (solved) [s] 417,643 | 1,212,584 | 1,000,466 | 563,832 580,435 


the benchmarks in these logics currently only involve to-floating-point conversion 
(SMT-LIB command to_fp) from real values, which is supported. 

We compare against Boolector [31] and the SMT-COMP 2022 version of 
Bitwuzla [26] (configuration SC22), which, at that time, was an improved and 
extended version of Boolector and won several divisions in all tracks of SMT- 
COMP 2022 [2]. Boolector did not participate in SMT-COMP 2022, thus we use 
the current version of Boolector available on GitHub (commit 13a8a06d) [1]. 
Further, since Boolector does not support logics involving floating-point arith- 
metic, quantified logics other than pure quantified bit-vectors and incremental 
solving when quantifiers are involved, we also compare against the SMT-COMP 
2022 versions of cvc5 [3] and Z3 [25]. Both solvers are widely used, high per- 
formance SMT solvers with support for a wide range of theories, including the 
theories supported by Bitwuzla. Note that this version of cvc5 uses a sequential 
portfolio of multiple configurations for some logics. 
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Table 2. Solved queries and total runtime on solved queries (incremental). 


Logic Boolector Z3 cvc5| SC22 Bitwuzla 
ABVFPLRA (2,269) = 2,220 818 55 2,269 
BV (38,856) -| 37,188 36,169| 35,567 35,246 
BVFP (458) - 458 458 274 458 
BVFPLRA (5,597) = 5,507 2,964| 3,144 4,797 
QF_ABV (3,411) 3,238 2,866 2,746 | 3,242 2,939 
QF_ABVFP (550,088) -| 515,714] 534,629 | 550,034) 550,041 
QF_ABVFPLRA (1,876) = 48 1,876| 1,876 1,876 
QF_AUFBV (967) 23 860 320 23 956 
QF_BV (53,684) 52,218 51,826 51,683 | 51,581 52,305 
QF_BVFP (3,465) = 3,403 3,487 | 3,444 3,438 
QF_BVFPLRA (32,736) = 31,287 32,681 | 32,736 32,736 
QF_FP (663) = 663 663 663 663 
QF_FPLRA (48) = 48 48 48 48 
QF_UFBV (5,492) 4,634 5,422 5,148) 2,317 5,489 
QF_UFFP (2) - 2 2 2 2 
Total (699,612) 60,113| 657,512| 673,642 | 685,006 | 693,263 
Time (solved) [s] 102,812 | 3,359,645 | 1,516,672 | 157,083 172,534 


We ran all experiments on a cluster with Intel Xeon E5-2620 v4 CPUs. We 
allocated one CPU core and 8GB of RAM for each solver and benchmark pair, 
and used a 1200s s time limit, the same time limit as used in SMT-COMP 
2022 [2]. 

Table 1 shows the number of solved benchmarks for each solver in the non- 
incremental quantifier-free (QF_) and quantified divisions. Overall, Bitwuzla 
solves the largest number of benchmarks in the quantified divisions, considerably 
improving over SC22 and Boolector with over 600 and 4,200 solved benchmarks, 
respectively. Bitwuzla also takes the lead in the quantifier-free divisions, with 44 
more solved instances compared to SC22, and more than 650 solved benchmarks 
compared to cvc5. On the 140,438 commonly solved instances between Bitwuzla, 
SC22, cvc5, and Z3 over all divisions, Bitwuzla is the fastest solver with 203,838s, 
SC22 is slightly slower with 208,310s, cvc5 is 2.85x slower (586,105s), and Z3 is 
5.1x slower (1,049,534s). 

Table 2 shows the number of solved incremental check-sat queries for each 
solver in the incremental divisions. Again, Bitwuzla solves the largest number of 
queries overall and in the quantifier-free divisions. For the quantified divisions, 
Bitwuzla solves 42,770 queries, the second largest number of solved queries after 
Z3 (45,373), and more than 3700 more queries than SC22 (39,040). On bench- 
marks of the ABVFPLRA division, Bitwuzla significantly outperforms SC22 due 
to the occurrence of nested arrays, which were unsupported in SC22. 

The artifact of this evaluation is archived and available in the Zenodo open- 
access repository at https://zenodo.org/record/7864687. 
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Conclusion 


Our experimental evaluation shows that Bitwuzla is a state-of-the-art SMT 
solver for the quantified and quantifier-free theories of fixed-size bit-vectors, 
arrays, floating-point arithmetic, and uninterpreted functions. Bitwuzla has been 
extensively tested for robustness and correctness with Murxla [30], an API fuzzer 
for SMT solvers, which is an integral part of its development workflow. We have 
outlined several avenues for future work throughout the paper. We further plan 
to add support for the upcoming SMT-LIB version 3 standard, when finalized. 
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Abstract. Sequence theories are an extension of theories of strings with 
an infinite alphabet of letters, together with a corresponding alphabet 
theory (e.g. linear integer arithmetic). Sequences are natural abstrac- 
tions of extendable arrays, which permit a wealth of operations including 
append, map, split, and concatenation. In spite of the growing amount 
of tool support for theories of sequences by leading SMT-solvers, little 
is known about the decidability of sequence theories, which is in stark 
contrast to the state of the theories of strings. We show that the decid- 
able theory of strings with concatenation and regular constraints can be 
extended to the world of sequences over an alphabet theory that forms a 
Boolean algebra, while preserving decidability. In particular, decidability 
holds when regular constraints are interpreted as parametric automata 
(which extend both symbolic automata and variable automata), but fails 
when interpreted as register automata (even over the alphabet theory of 
equality). When length constraints are added, the problem is Turing- 
equivalent to word equations with length (and regular) constraints. Sim- 
ilar investigations are conducted in the presence of symbolic transduc- 
ers, which naturally model sequence functions like map, split, filter, etc. 
We have developed a new sequence solver, SECO, based on parametric 
automata, and show its efficacy on two classes of benchmarks: (i) invari- 
ant checking on array-manipulating programs and parameterized sys- 
tems, and (ii) benchmarks on symbolic register automata. 


1 Introduction 


Sequences are an extension of strings, wherein elements might range over an infi- 
nite domain (e.g., integers, strings, and even sequences themselves). Sequences 
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are ubiquitous and commonly used data types in modern programming lan- 
guages. They come under different names, e.g., Python/Haskell/Prolog lists, 
Java ArrayList (and to some extent Streams) and JavaScript arrays. Crucially, 
sequences are extendable, and a plethora of operations (including append, map, 
split, filter, concatenation, etc.) can naturally be defined and are supported by 
built-in library functions in most modern programming languages. 

Various techniques in software model checking [30] — including symbolic 
execution, invariant generation — require an appropriate SMT theory, to which 
verification conditions could be discharged. In the case of programs operating on 
sequences, we would consequently require an SMT theory of sequences, for which 
leading SMT solvers like Z3 [6,38] and cvc5 [4] already provide some basic sup- 
port for over a decade. The basic design of sequence theories, as done in Z3 and 
cvc5, as well as in other formalisms like symbolic automata [15], is in fact quite 
natural. That is, sequence theories can be thought of as extensions of theories of 
strings with an infinite alphabet of letters, together with a corresponding alpha- 
bet theory, e.g. Linear Integer Arithmetic (LIA) for reasoning about sequences of 
integers. Despite this, very little is known about what is decidable over theories 
of sequences. 

In the case of finite alphabets, sequence theories become theories over strings, 
in which a lot of progress has been made in the last few decades, barring the 
long-standing open problem of string equations with length constraints (e.g. see 
[26]). For example, it is known that the existential theory of concatenation over 
strings with regular constraints is decidable (in fact, PSPACE-complete), e.g., 
see [17,29,36,40,43]. Here, a regular constraint takes the form x € L(E), where 
E is a regular expression, mandating that the expression E matches the string 
represented by x. In addition, several natural syntactic restrictions — including 
straight-line, acylicity, and chain-free (e.g. [1,2,5,11,12,26,35]) — have been 
identified, with which string constraints remain decidable in the presence of more 
complex string functions (e.g. transducers, replace-all, reverse, etc.). In the case 
of infinite alphabets, only a handful of results are available. Furia [25] showed 
that the existential theory of sequence equations over the alphabet theory of 
LIA is decidable by a reduction to the existential theory of concatenation over 
strings (over a finite alphabet) without regular constraints. Loosely speaking, a 
number (e.g. 4) can be represented as a string in unary (e.g. 1111), and addition 
is then simulated by concatenation. Therefore, his decidability result does not 
extend to other data domains and alphabet theories. Wang et al. [45] define an 
extension of the array property fragment [9] with concatenation. This fragment 
imposes strong restrictions, however, on the equations between sequences (here 
called finite arrays) that can be considered. 


“Regular Constraints” Over Sequences. One answer of what a regular constraint 
is over sequences is provided by automata modulo theories. Automata modulo 
theories [15,16] are an elegant framework that can be used to capture the notion 
of regular constraints over sequences: Fix an alphabet theory T that forms a 
Boolean algebra; this is satisfied by virtually all existing SMT theories. In this 
framework, one uses formulas in T to capture multiple (possibly infinitely many) 
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transitions of an automaton. More precisely, between two states in a symbolic 
automaton one associates a unary’ formula y(x) € T. For example, q >ọ q' 
with y := x = 0 (mod 2) over LIA corresponds to all transitions q >; q’ with 
any even number i. Despite their nice properties, it is known that many sim- 
ple languages cannot be captured using symbolic automata; e.g., one cannot 
express the language consisting of sequences containing the same even number i 
throughout the sequence. 

There are essentially two (expressively incomparable) extensions of sym- 
bolic automata that address the aforementioned problem: (i) Symbolic Regis- 
ter Automata (SRA) [14] and (ii) Parametric Automata (PA) [21,23,24]. The 
model SRA was obtained by combining register automata [31] and symbolic 
automata. The model PA extends symbolic automata by allowing free variables 
(a.k.a. parameters) in the transition guards, i.e., the guard will be of the form 
y(x,p), for parameters p. In an accepting path of PA, a parameter p used in 
multiple transitions has to be instantiated with the same value, which enables 
comparisons of different positions in an input sequence. For example, we can 
assert that only sequences of the form 7*, for an even number i, are accepted by 
the PA with a single transition q >ẹ q with y(z,p) := x = pA x = 0 (mod 2) 
and q being the start and final state. PA can also be construed as an extension 
of both variable automata [27] and symbolic automata. SRA and PA are not 
comparable: while parameters can be construed as read-only registers, SRA can 
only compare two different positions using equality, while PA may use a general 
formula in the theory in such a comparison (e.g., order). 


Contributions. The main contribution of this paper is to provide the first decid- 
able fragments of a theory of sequences parameterized in the element theory. 
In particular, we show how to leverage string solvers to solve theories over 
sequences. We believe this is especially interesting, in view of the plethora of 
existing string solvers developed in the last 10 years (e.g. see the survey [3]). 
This opens up new possibilities for verification tasks to be automated; in partic- 
ular, we show how verification conditions for Quicksort, as well as Bakery and 
Dijkstra protocols, can be captured in our sequence theory. This formalization 
was done in the style of regular model checking [8,34], whose extension to infinite 
alphabets has been a longstanding challenge in the field. We also provide a new 
(dedicated) sequence solver SECO We detail our results below. 

We first show that the quantifier-free theory of sequences with concatenation 
and PA as regular constraints is decidable. Assuming that the theory is solvable 
in PSPACE (which is reasonable for most SMT theories), we show that our algo- 
rithm runs in EXPSPACE (i.e., double-exponential time and exponential space). 
We also identify conditions on the SMT theory T under which PSPACE can be 
achieved and as an example show that Linear Real Arithmetic (LRA) satisfies 
those conditions. This matches the PSPACE-completeness of the theory of strings 
with concatenation and regular constraints [18]. 

We consider three different variants/extensions: 


1 This can be generalized to any arity, which has to be set uniformly for the automaton. 
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(i) Add length constraints. Length constraints (e.g., |x| = |y| for two sequence 
variables x,y) are often considered in the context of string theories, but 
the decidability of the resulting theory (i.e., strings with concatenation and 
length constraints) is still a long-standing open problem [26]. We show that 
the case for sequences is Turing-equivalent to the string case. 

(ii) Use SRA instead of PA. We show that the resulting theory of sequences is 
undecidable, even over the alphabet theory T of equality. 

(iii) Add symbolic transducers. Symbolic transducers [15,16] extend finite-state 
input/output transducers in the same way that symbolic automata extend 
finite-state automata. To obtain decidability, we consider formulas satisfying 
the straight-line restriction that was defined over strings theories [35]. We 
show that the resulting theory is decidable in 2-EXPTIME and is EXPSPACE- 
hard, if T is solvable in PSPACE. 


We have implemented the solver SECO based on our algorithms, and demon- 
strated its efficacy on two classes of benchmarks: (i) invariant checking on 
array-manipulating programs and parameterized systems, and (ii) benchmarks 
on Symbolic Register Automata (SRA) from [14]. For the first benchmarks, 
we model as sequence constraints invariants for QuickSort, Dijkstra’s Self- 
Stabilizing Protocol [20] and Lamport’s Bakery Algorithm [33]. For (ii), we solve 
decision problems for SRA on benchmarks of [14] such as emptiness, equivalence 
and inclusion on regular expressions with back-references. We report promising 
experimental results: our solver SECO is up to three orders of magnitude faster 
than the SRA solver in [14]. 


Organization. We provide a motivating example of sequence theories in Sect. 2. 
Section 3 contains the syntax and semantics of the sequence constraint language, 
as well as some basic algorithmic results. We deal with equational and regular 
constraints in Sect. 4. In Sect. 5, we deal with the decidable fragments with equa- 
tional constraints, regular constraints, and transducers. We deal with extensions 
of these languages with length and SRA constraints in Sect. 6. In Sect. 7 we report 
our implementation and experimental results. We conclude in Sect. 8. Missing 
details and proofs can be found in the full version. 


2 Motivating Example 


We illustrate the use of sequence theories in verification using a implementation 
of QuickSort [28], shown in Listing 1. The example uses the Java Streams API 
and resembles typical implementations of QuickSort in functional languages; the 
program uses high-level operations on streams and lists like filter and concatena- 
tion. As we show, the data types and operations can naturally be modelled using 
a theory of sequences over integer arithmetic, and our results imply decidability 
of checks that would be done by a verification system. 

The function quickSort processes a given list 1 by picking the first element 
as the pivot p, then creating two sub-lists left, right in which all numbers 
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/*Q 
* ensures \forall int i; \result.contains(i) == 1.contains(i); 
*/ 
public static List<Integer> quickSort(List<Integer> 1) { 
if (l.size() < 1) return 1; 
Integer p = 1.get(0); 
List<Integer> left = l.stream().filter(i -> i < p) 
.collect (Collectors.toList()); 
1l.stream().skip(1).filter(i -> i >= p) 
.collect (Collectors.toList()); 
List<Integer> result = quickSort (left); 
result.add(p); result.addAll(quickSort (right)) ; 
return result; 


List<Integer> right 


Listing 1. Implementation of QuickSort with Java Streams. 


>p (resp., <p) have been eliminated. The function quickSort is then recur- 
sively invoked on the two sub-lists, and the results are finally concatenated and 
returned. 

We focus on the verification of the post-condition shown in the beginning of 
Listing 1: sorting does not change the set of elements contained in the input list. 
This is a weaker form of the permutation property of sorting algorithms, and as 
such known to be challenging for verification methods (e.g., [42]). Sortedness of 
the result list can be stated and verified in a similar way, but is not considered 
here. Following the classical design-by-contract approach [37], to verify the par- 
tial correctness of the function it is enough to show that the post-condition is 
established in any top-level call of the function, assuming that the post-condition 
holds for all recursive calls. For the case of non-empty lists, the verification con- 
dition, expressed in our logic, is: 


left = Tai, (1) A right = T> (skip; (1)) A 
Vi. (i € left = i € left’) A Vi. (i € right e i € right’) A 
res = left’ . [lo] . right’ 


Vi. (i € le i € res) 


The variables 1, res, left, right, left’, right’ range over sequences of integers, 
while 7 is a bound integer variable. The formula uses several operators that a 
useful sequence theory has to provide: (i) lọ: the first element of input list 1; 
(ii) € and ¢: membership and non-membership of an integer in a list, which 
can be expressed using symbolic parametric automata; (iii) skip,, T<1,, T>1: 
sequence-to-sequence functions, which can be represented using symbolic para- 
metric transducers; (iv) -.-: concatenation of several sequences. The formula oth- 
erwise is a direct model of the method in Listing 1; the variables left’, right’ are 
the results of the recursive calls, and concatenated to obtain the result sequence. 
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In addition, the formula contains quantifiers. To demonstrate validity of the 
formula, it is enough to eliminate the last quantifier Vi by instantiating with a 
Skolem symbol k, and then instantiate the other quantifiers (left of the implica- 
tion) with the same k: 


left = Te, (1) A right = T>1, (skip, (1)) A 
(k € left > k € left’) A (k € right > k € right’) ^A | — (kele k€ res) 
res = left’ . [Io] . right’ 


As one of the results of this paper, we prove that this final formula is in a 
decidable logic. The formula can be rewritten to a disjunction of straight-line 
formulas, and shown to be valid using the decision procedure presented in Sect. 5. 


3 Models 


In this section, we will define our sequence constraint language, and prove some 
basic results regarding various constraints in the language. The definition is a 
natural generalization of string constraints (e.g. see [12,17,26,29,35]) by employ- 
ing an alphabet theory (a.k.a. element theory), as is done in symbolic automata 
and automata modulo theories [15,16, 44]. 

For simplicity, our definitions will follow a model-theoretic approach. Let o 
be a vocabulary. We fix a o-structure © = (D; I), where D can be a finite or 
an infinite set (i.e., the universe) and J maps each function/relation symbol in 
o to a function/relation over D. The elements of our sequences will range over 
D. We assume that the quantifier-free theory Te over G (including equality) 
is decidable. Examples of such Te are abound from SMT, e.g., LRA and LIA. 
We write T instead of Te, when G is clear. Our quantifier-free formula will use 
uninterpreted T-constants a, b,c,..., and may also use variables x,y, z,.... (The 
distinction between uninterpreted constants and variables is made only for the 
purpose of presentation of sequence constraints, as will be clear shortly.) We use 
C to denote the set of all uninterpreted T-constants. A formula ¢ is satisfiable if 
there is an assignment that maps the uninterpreted constants and variables to 
concrete values in D such that the formula becomes true in G. 

Next, we define how we lift T to sequence constraints, using T as the alphabet 
theory (a.k.a. element theory). As in the case of strings (over a finite alphabet), 
we use standard notation like D* to refer to the set of all sequences over D. By 
default, elements of D* are written as standard in mathematics, e.g., 7,8, 100, 
when D = Z. Sometimes we will disambiguate them by using brackets, e.g., 
(7,8,100) or [7,8,100]. We will use the symbol s (with/without subscript) to 
refer to concrete sequences (i.e., a member of D*). We will use x,y,z to refer 
to T-sequence variables. Let V denote the set of all T-sequence variables, and 
I := CU D. We will define constraint languages syntactically at the beginning, 
and will instantiate them to specific sequence operations. The theory T* of T- 
sequences consists of the following constraints: 


p == R(x1,...,Xr) | PAY 
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where R is an r-ary relation symbol. In our definition of each atom R below, we 
will specify if an assignment u, which maps each x; to a T-sequence and each 
uninterpreted constant to a T-element, satisfies R. If yw satisfies all atoms, we 
say that u is a solution and the satisfiability problem is to decide whether there 
is a solution for a given yp. 

A few remarks about the missing boolean operators in the constraint lan- 
guage above are in order. Disjunctions can be handled easily using the DPLL(T) 
framework (e.g. see [32]), so we have kept our theory conjunctive. As in the case 
of strings, negations are usually handled separately because they can sometimes 
(but not in all cases) be eliminated while preserving decidability. 


Equational Constraints. A T-sequence equation is of the form 
L=R 


where each of L and R is a concatenation of concrete T-elements, uninterpreted 
constants, and T-sequence variables. That is, if O := T U V, then L, R € ©*. 
For example, in the equation 


0.1.x = x.0.1 


the set of all solutions is of the form x + (01)*. To make this more formal, we 
extend each assignment u to a homomorphism on O*. We write u = L = R if 
p(L) = a(R). Notice that this definition is just direct extension of that of word 
equations (e.g. see [17]), i.e., when the domain D is finite. 

In most cases the inequality constraints L Æ R can be reduced to equality in 
our case this requires also element constraints, described below. 


Element Constraints. We allow T-formulas to constrain the uninterpreted con- 
stants. More precisely, given a T-sentence (i.e., no free variables) y that uses C 
as uninterpreted constants, we obtain a proposition P (i.e., 0-ary relation) that 
BE PifT H, g. 

Negations in the equational constraints can be removed just like in the case of 
strings, i.e., by means of additional variables /constants and element constraints. 
For example, x Æ y can be replaced by (x = zax’ Ay = zby' ^a Æ b) Vx = 
yaz V xaz = y. Notice that a Æ b is a T-formula because we assume the equality 
symbol in T. 


Regular Constraints. Over strings, regular constraints are simply unary con- 
straints U(x), where U is an automaton. The interpretation is x is in the language 
of U. We define an analogue of regular constraints over sequences using paramet- 
ric automata [21,23,24], which generalize both symbolic automata [15,16] and 
variable automata [27]. 

A parametric automaton (PA) over T is of the form A = (¥, Q, A, qo, F), 
where is a finite set of parameters, Q is a finite set of control states, go € Q is 
the initial state, F C Q is the set of final states, and ACfnQ x T(curr, ¥) x Q. 
Here, parameters are simply uninterpreted T-constants, i.e., X C C. Formulas 
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that appear in transitions in A will be referred to as guards, since they restrict 
which transitions are enabled at a given state. Note that curr is an uninterpreted 
constant that refers to the “current” position in the sequence. The semantics is 
quite simply defined: a sequence (dı, d2,...,dn) is in the language of A under 
the assignment of parameters u, written as (d1,...,dn) € L,(A), when there is 
a sequence of A-transitions 


(qo, pı(curr, X), qı), (q, ye(curr, xX), q2), E (dn-1, Pn(curr, X), qn), 


such that qn € F and T — y;(d;, u(¥)). Finally, for a regular constraint A(x) is 
satisfied by u, when u(x) € L (A). 

Note, that it is possible to complement a PA A, one has to be careful with the 
semantics: we treat A as a symbolic automaton, which are closed under boolean 
operations [15]. So we are looking for u such that p(x) € L,,(x). What we cannot 
do using the complementation, is a universal quantification over the parameters; 
note that already theory of strings with universal and existential quantifiers is 
undecidable. 

We state next a lemma showing that PAs using only “local” parameters, 
together with equational constraints, can encode the constraint language that 
we have defined so far. 


Lemma 1. Satisfiability of sequence constraints with equation, element, and reg- 
ular constraints can be reduced in polynomial-time to satisfiability of sequence 
constraints with equation and regular constraints (i.e., without element con- 
straints). Furthermore, it can be assumed that no two regular constraints share 
any parameter. 


Proposition 1. Assume that T is solvable in NP (resp. PSPACE). Then, decid- 
ing nonemptiness of a parametric automaton over T is in NP (resp. PSPACE). 


The proof is standard (e.g. see [21, 23,24]), and only sketched here. The algorithm 
first nondeterministically guesses a simple path in the automaton A from an 
initial state qo to some final state qr. Let us say that the guards appearing 
in this path are W(curr, V),...,Ux(curr, X). We need to check if this path is 
realizable by checking T-satisfiability of 


k 
AX. N dcurr. (Y;(curr, ¥)). 
i=1 


It is easy to see that this is an NP (resp. NPSPACE = PSPACE) procedure. 


Parametric Transducers. We define a suitable extension of symbolic transducers 
over parameters following the definition from Veanes et al. [44]. A transducer 
constraint is of the form y = T(x), for a parametric transducer T. A parametric 
transducer over T is of the form T = (¥,Q,A,q0,F), where ¥, Q, qo, F are 
just like in parametric automata. Unlike parametric automata, A is a finite set 
of tuples of the form (p,(y,w),q), where (p,y,q) is a standard transition in 
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parametric automaton, and w is a (possibly empty) sequence of T-terms over 
variable curr and constants ¥, e.g., w = (curr+7, curr+2). One can think of w 
as the output produced by the transition. Given an assignment u of parameters 
and the sequence variables, the constraint y = T(x) is satisfied when there is a 
sequence of A-transitions 


(qo, pı(curr, X), wi, qı), (q, yo(curr, X), We, q2), eis Cae Yn(curr, X), Wn, dn), 
such that qn € F and T F ẹọ:(di, u(&)), where u(x) = (d1, ..., dn), and finally 


L(y) = (W1) Hn(Wn) 


where u; is 4 but maps curr to di. The definition assumes that p; is extended 
to terms and concatenation thereof by homomorphism, e.g., in LRA, if wy = 
(curr +7, curr+2) and nı maps curr to 10, then w, will get mapped to 17, 12. 
Given a set S C D* and an assignment u (mapping the constants to D), we define 
the pre-image T(S) of S under T with respect to u as the set of sequences 
w € D* such that w’ = 7 (w) holds with respect to p. 


4 Solving Equational and Regular Constraints 


Here we present results on solving equational constraints, together with regular 
constraints, by a reduction to the string case, for which a wealth of results are 
already available. In general, this reduction causes an exponential blow-up in the 
resulting string constraint, which we show to be unavoidable in general. That 
said, we also provide a more refined analysis in the case when the underlying 
theory is LRA, where we can avoid this exponential blow-up. 


Prelude: The Case of Strings. We start with some known results about the 
case of strings. The satisfiability of word equations with regular constraints is 
PSPACE-complete [18,19]. This upper bound can be extended to full quantifier- 
free theory [10]. When no regular constraints are given, the problem is only 
known to be NP-hard, and it is widely believed to be in NP. In the absence of 
regular constraints, without loss of generality I’ can be assumed to contain only 
letters from the equations; this is not the case in presence of regular constraints. 
The algorithm solving word equations [19] does not need an explicit access to 
I: it is enough to know whether there is a letter which labels a given set of 
transitions in the NFAs used in the regular constraints. In principle, there could 
be exponentially many different (i-e., inducing different transitions in the NFAs) 
letters. When oracle access to such alphabet is provided, the satisfiability can still 
be decided in PSPACE: while not explicitly claimed, this is exactly the scenario 
in [19, Sect. 5.2] 

Other constraints are also considered for word equations; perhaps the most 
widely known are the length constraints, which are of the form: X` „ey dx:|a| < c, 
where {az }xev, c are integer constants and |x| denotes the length |u(a)|, with an 
obvious semantics. It is an open problem, whether word equations with length 
constraints are decidable, see [26]. 
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Reduction to Word Equations. We assume Lemma 1, i.e. that the parame- 
ters used for different automata-based constraints are pairwise different. In par- 
ticular, when looking for a satisfying assignment u we can first fix assignment 
for ¥ and then try to extend it to V. To avoid confusion, we call this partial 
assignment 7: X — D. 

Consider a set ® of all atoms in all guards in the regular constraints together 
with the set of formulas {x = c} over all constants c € D that appear in all equa- 
tional constraints and the negations of both types of formulas. Fix an assignment 
T : X — D. The type type,(a) of a (under assignment r) is the set of formulas 
in @ satisfied by a, i.e. {p E€ ® : y(n(A’),a) holds}. Clearly there are at most 
exponentially many different types (for a fixed 7). A type t is realizable (for 7) 
when t = type, (a) and it is realized by a. 

If the constraints are satisfiable (for some parameters assignment 7) then they 
are satisfiable over a subset DrCfnD, in the sense that we assign uniterpreted 
constants elements from Dy and T-sequence variables elements of Dž, where Dy 
is created by taking (arbitrarily) one element of a realizable type. Note that for 
each constant c in the equational constraints there is a formula “xz = œ’ in 9, in 
particular type,(c) is realizable (only by c) and so c € Dy. 


Lemma 2. Given a system of constraints and a parameter assignment m let 
D, C D be obtained by choosing (arbitrarily) for each realizable type a single 
element of this type. Then the set of constraints is satisfiable (for n) over D if 
and only if they are satisfiable (for 7) over Dy. To be more precise, there is a 
letter-to-letter homomorphism p : D* — D* such that if u is a solution of a 
system of constraints then y o u is also a solution. 


The proof can be found in the full version, its intuition is clear: we map each 
letter a € D to the unique letter in D, of the same type. 

Once the assignment is fixed (to 7) and domain restricted to a finite set (D+), 
the equational and regular constraints reduce to word equations with regular 
constraints: treat D, as a finite alphabet, for a parametric automaton A = 
(X, Q, A, qo, F) create an NFA A’ = (Dz, Q, A’, qo, F), i.e. over the alphabet Dx, 
with the same set of states Q, same starting state qo and accepting states F and 
the relation defined as (q,a,q') € A’ if and only if there is (q, y(curr, ¥),q') € A 
such that y(a, 7(¥)) holds, i.e. we can move from q to q’ by a in A’ if and only if 
we can make this move in A under assignment 7. Clearly, from the construction 


Lemma 3. Given an assignment of parameters m let D, be a set from Lemma 2, 
A be a parametric automaton and A' the automaton as constructed above. Then 


L(A) N Dt = L(A’) . 


We can rewrite the parametric automata-constraints with regular constraints 
and treat equational constraints as word equations (over the finite alphabet D,). 
From Lemma 2 and Lemma 3 it follows that the original constraints have a 
solution for assignment m if and only if the constructed system of constraints 
has a solution. Therefore once the appropriate assignment 7 is fixed, the validity 
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of constraints can be verified [19]. It turns out that we do not need the actual 
T, it is enough to know which types are realisable for it, which translates to an 
exponential-size formula. We will use letter 7 to denote subset of &; the idea is 
that r = {type,(a) : a € D} C 2° and if different 7,7’ give the same sets of 
realizable types, then they both yield a satisfying assignment or both not. Hence 
it is enough to focus on 7 and not on actual r. 


Lemma 4. Given a system of equational and regular constraints we can non- 
deterministically reduce them to a formula of a form 


tera E€ D.IX € Dt. N N g(a) , (1) 


tET pet 


where T C 2? is of at most exponential size, and a system of word equations with 
regular constraints of linear size and over an |r|-size alphabet, using auxiliary 
O(nj|r|) space. The solution of the latter word equations (for which also (1) holds) 
are solutions of the original system, by appropriate identifications of symbols. 


Proof. We guess the set 7 of types of the assignment of parameters 7, i.e. T = 
{type,(a) : a € D} such that there is an assignment u extending 7; note that 
as ® has linearly many atoms and T C 2°, then |r| may be of exponential size, 
in general. The (1) verifies the guess: we validate whether there are values of 1 
such that for each type t € 7 there is a value a such that type, (a) = t. 

Let D, be a set having one symbol per every type in 7, as in Lemma 2; note 
that this includes all constants in the equational constraints. The algorithm will 
not have access to particular values, instead we store each t € 7, say as a bitvector 
describing which atoms in @ this letter satisfies. In particular, |D,| = |r| and it 
is at most exponential. In the following we will consider only solutions over Dy. 

For each a € D, we can validate, which transitions in A it can take: the 
transition is labelled by a guard which is a conjunction of atoms from ® and 
either each such atom is in type, (a) or not. Hence we can treat A as an NFA for 
D,,. We do not need to construct nor store it, we can use A: when we want to 
make a transition by y(4’,a) we look up, whether each atom of ¢ is in type, (a) 
or not. Similarly, the constraint A(x) is restricted to x € L(A) and for x € Dž 
this is a usual regular constraint. 

We treat equational constraints as word equations over alphabet D,. 

Concerning the correctness of the reduction: if the system of word equations 
(with regular constraints) is satisfiable and the formula (1) is also satisfiable, 
then there is a satisfying assignment u over Dy and D% in particular, there is an 
assignment of parameters for which there are letters of the given types (note that 
in principle it could be that u induces more types, i.e. there is a value a such that 
type,,(a) ¢ T and so it is not represented in D,, but this is fine: enlarging the 
alphabet cannot invalidate a solution), i.e. the transitions for a; in the automata 
after the reduction are the same as in the corresponding parametric automata 
for the assignment 7, this is guaranteed by the satisfiability of (1) and the way 
we construct the instance, see Lemma 3. 
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On the other hand, when there is a solution of the input constraints, there is 
one for some assignment of parameters 7. Hence, by Lemma 2, there is a solution 
over D,. The algorithm guesses T = {type,(a) : a € D} and (1) is true for it. 
Then by Lemma 2 there is a solution over D, as constructed in the reduction 
and by Lemma 3 the regular constraints define the same subsets of Dž both 
when interpreted as parametric automata and NFAs. 


Theorem 1. If theory T is in PSPACE then sequence constraints are in 
EXPSPACE. 

If r is polynomial size and the formula (1) can be verified in PSPACE, then 
sequence constraints can be verified in PSPACE. 


One of the difficulties in deciding sequence constraints using the word equa- 
tions approach is the size of set of realizable types 7, which could be exponential. 
For some concrete theories it is known to be smaller and thus a lower upper 
bound on complexity follows. For instance, it is easy to show that for LRA there 
are linearly many realizable types, which implies a PSPACE upper bound. 


Corollary 1. Sequence constraints for Linear Real Arithmetic are in PSPACE. 


In general, the EXPSPACE upper bound from Theorem 1 cannot be improved, 
as even non-emptiness of intersection of parametric automata is EXPSPACE- 
complete for some theories decidable in PSPACE. This is in contrast to the case 
of symbolic automata, for which the non-emptiness of intersection (for a theory 
T decidable in PSPACE) is in PSPACE. This shows the importance of parameters 
in our lower bound proof. 


Theorem 2. There are theories with existential fragment decidable in PSPACE 
and whose non-emptiness of intersection of parametric automata is EXPSPACE- 
complete. 


When no regular constraints are allowed, we can solve the equational and 
element constraints in PSPACE (note that we do not use Lemma 1). 


Theorem 3. For a theory T decidable in PSPACE, the element and equational 
constraints (so no regular constraints) can be decided in PSPACE. 


5 Algorithm for Straight-Line Formulas 


It is known that adding finite transducers into word equations results in an 
undecidable model (e.g. see [35]). Therefore, we extend the straight-line restric- 
tion [12,35] to sequences, and show that it suffices to recover decidability for 
equational constraints, together with regular and transducer constraints. In fact, 
we will show that deciding problems in the straight-line fragment is solvable in 
doubly exponential time and is EXPSPACE-hard, if T is solvable in PSPACE. It 
has been observed that the straight-line fragment for the theory of strings already 
covers many interesting benchmarks [12,35], and similarly many properties of 
sequence-manipulating programs can be proven using the fragment, including 
the QuickSort example from Sect. 2 and other benchmarks shown in Sect. 7. 
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The Straight-Line Fragment SL. We start by defining recognizable formu- 
las over sequences, followed by the syntactic and semantic restrictions on our 
constraint language. This definition follows closely the definition of recogniz- 
able relations over finite alphabets, except that we replace finite automata with 
parametric automata. 


Definition 1 (Recognizable formula). A formula R(a,...,x,) is recogniz- 
able if it is equivalent to a positive Boolean combination of regular constraints. 


Note that this is simply a generalization of regular constraints to multiple vari- 
ables, i.e., l-ary recognizable formula can be turned into a regular constraint, 
which is closed under intersection and union. 

To define the straight-line fragment, we use the approach of [12]; that is, 
the fragment is defined in terms of “feasibility of a symbolic execution”. Here, 
a symbolic execution is just a sequence of assignments and assertions, whereas 
the feasibility problem amounts to deciding whether there are concrete values 
of the variables so that the symbolic execution can be run and none of the 
assertions are violated. We now make this intuition formal. A symbolic execution 
is syntactically generated by the following grammar: 


S u= y:= f(X1,..., Xk, X) | assert(R(x1,...,x,)) | assert(y) | S;S (2) 


where f : (D*)* x DI¥I — D is a function, R are recognizable formulas, and y 
are element constraints. 

The symbolic execution S can be turned into a sequence constraint as follows. 
Firstly, we can turn S into the standard Static Single Assignment (SSA) form 
by means of introducing new variables on the left-hand-side of an assignment. 
For example, y := f(x); y := g(z) becomes y := f(x1);y’ := g(z). Then, in the 
resulting constraint, each variable appears at most once on the left-hand-side 
of an assignment. That way, we can simply replace each assignment symbol := 
with an equality symbol =. We then treat each sequential composition as the 
conjunction operator ^ and assertion as a conjunct. Note that individual asser- 
tions are already sequence constraints. Next, we define how an interpretation u 
satisfies the constraint y = f(x1,...,X,, 4%): 


MEY =f(%1,.--,Xr,¥) if uly) = f(u(x1),.--, uXr), u(¥)). 


Note that °=’ on the l.h.s. is syntactic, while the °=’ on the r.h.s. is in the 
metalanguage. The definition of the semantics of the language is now inherited 
from Sect. 3. 

In addition to the syntactic restrictions, we also need a semantic condition: 
in our language, we only permit functions f such that the pre-image of each 
regular constraint under f is effectively a recognizable formula: 


(RegInvRel) A function f is permitted if for each regular constraint A(y), it is 
possible to compute a recognizable formula that is equivalent to the formula 


y: Aly) Ay = f(x1,...,Xr, X). 
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Two functions satisfying (RegInvRel) are the concatenation function x := y.z 
(here y could be the same as z) and parametric transducers y := T(x). We will 
only use these two functions in the paper, but the result is generalizable to other 
functions. 


Proposition 2. Given a regular constraint A(y) and a constraint y = z.z, we 
can compute a recognizable formula (ax, z) equivalent to dy: A(y) A y = z.z. 
Furthermore, this can be achieved in polynomial time. 


The proof of this proposition is exactly the same as in the case of strings, e.g., 
see [12,35]. 


Proposition 3. Given a regular constraint A(y) and a parametric transducer 
constraint y = T (x), we can compute a regular constraint A' (x) that is equivalent 
to dy: Aly) \y=T(a). This can be achieved in exponential time. 


The construction in Proposition 3 is essentially the same as the pre-image com- 
putation of a symbolic automaton under a symbolic transducer [44]. The com- 
plexity is exponential in the maximum number of output symbols of a single 
transition (i.e. the maximum length of w in the transducer), which is in practice 
a small natural number. 

The following is our main theorem on the SL fragment with equational con- 
straints, regular constraints, and transducers. 


Theorem 4. [fT is solvable in PSPACE, then the SL fragment with concatena- 
tion and parametric transducers over T is in 2-EXP'TIME and is EXPSPACE-hard. 


Proof. We give a decision procedure. We assume that S' is already in SSA (i.e. 
each variable appears at most once on the left-hand side). Let us assume that S$ 
is of the form S’; y := f (X1, ...x,), for some symbolic execution S”. Without loss 
of generality, we may assume that each recognizable constraint is of the form 
A(x). This is no limitation: (1) since each R in the assertion is a recognizable 
formula, we simply have to “guess” one of the implicants for each R, and (2) 
assert(~1 A %2) is equivalent to assert (y); assert (w2). 

Assume now that {Ai(y),...,Am(y)} are all the regular constraints on y in 
S. By our assumption, it is possible to compute a recognizable formula equivalent 
to 


m 

W(X1,-..,Xr) = Jy : \ Aily) Ay = f(x1,.--,Xr)- 

i=1 
There are two ways to see this. The first way is that regular constraints are closed 
under intersection. This is in general computationally quite expensive because 
of a product automata construction before applying the pre-image computation. 
A better way to do this is to observe that w is equivalent to the conjunction of 
q,’s over i = 1,...,m, where 


pi = dy: Aly) Ay = f(x1,.--.,X,). 
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curr #k T T 
EO C curr = k 
start start qo 
Ao A 


Fig. 1. Ao accepts all words not containing k and A; accepts all words containing k. 


By our semantic condition, we can compute recognizable formulas yj,..., i, 
equivalent to W1,...,Wm respectively. Therefore, we simply replace S by 


S’; assert(w);--- ;assert(zj,,), 


in which every occurrence of y has been completely eliminated. Applying the 
above variable elimination iteratively, we obtain a conjunction of regular con- 
straints. We now end up with a conjunction of regular constraints and element 
constraints, which as we saw from Sect. 4 is decidable. 


Example 1. We consider the example from Sect.2 where a weaker form of the 
permutation property is shown for QuickSort. The formula that has to be proven 
is a disjunction of straight-line formulas and in the following we execute our 
procedure only on one disjunct without redundant formulas: 


assert (Ap (left’)); assert(Ao(right’)); res = left’ . [lo] . right’; assert(A,(res)) 


We model L(A) as the language which accepts all words which contain 
one letter equal to k and L(A) as the language which accepts only words not 
containing k, where k is an uninterpreted constant, so a single element. See 
Fig. 1. We begin by removing the operation res = left’ . [lọ] . right’. The product 
automaton for all assertions that contain res is just A;. Hence, we can remove the 
assertion assert(A;(res)). The concatenation function . satisfies RegInvRel 
and the pre-image g can be represented by 


VV AWT} (left!) A AX CN) A AGT right’), 
0<i,7<1 


where Ap is A; with start state set to p and finals to F”. 
In the next step, the assertion g is added to the program and all assertions 
containing res and the concatenation function are removed. 


assert(Ap (left’)); assert (A(right’)); assert (g(left’ , [lo], right’ )) 
From here, we pick a tuple from g, lets say i = j = 1, and obtain 


assert (Ag (left’)); assert(Ao(right’)); assert (left’ € AWAN; 
assert( [lọ] € AWEN; assert(right’ € ATEN 
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Finally, the product automata Ao x A% T} and Ap x At are com- 
puted for the variables left’, right’ and a non-emptiness check over the prod- 
uct automata and the automaton for [lp] is done. The procedure will find no 
combination of paths for each automaton which can be satisfied, since left’ 
is forced to accept no words containing k by Ao and only accepts by read- 
ing a k from APANI, Next, the procedure needs to exhaust all tuples from 
(APE Amt, AS ahoeri before it is proven that this disjunct is unsat- 
isfiable. 


6 Extensions and Undecidability 


Length Constraints. We consider the extension of our model by allowing 
length-constraints on the sequence variables: for each sequence variable x we 
consider the associated length variable Zx, let the set of length variables be 
L= {lx : XE V}, we extend p to £, it assigns natural numbers to them. The 
length constraints are of the form >, axlx?0, where ? € {<, <, =, #, >, >} and 
each ax is an integer constant, i.e., linear arithmetic formulas on the length- 
variables. The semantics is natural: we require that |u(x)| = u(x) (the assigned 
values are the true lengths of sequences) and that u(£) satisfies each length 
constraint. 

There is, however, another possible extensions: if we the theory Te is the 
Presburger arithmetic, then the parameter automata could use the values 4%. 
We first deal with a more generic, though restricted case, when this is not 
allowed: then all reductions from Sect.4 generalize and we can reduce to the 
word equations with regular and length constraints. However, the decidability 
status of this problem is unknown. When we consider Presburger arithmetic and 
allow the automata to employ the length variables, then it turns out that we 
can interpret the formula (1) as a collection of length constraints, and again we 
reduce to word equations with regular and length constraints. 


Automata Oblivious of Lengths. We first consider the setting, in which the length 
variables £ can only be used in length constraints. It is routine to verify that 
the reduction from Sect. 4 generalize to the case of length constraints: it is pos- 
sible to first fix u for parameters, calling it again m. Then Lemma 2 shows 
that each solution u can be mapped by a letter-to-letter homomorphism to a 
finite alphabet D,, and this mapping preserves the satisfiability /unsatisfiability 
of length constraints, so Lemma 2 still holds when also length constraints are 
allowed. Similarly, Lemma 3 is also not affected by the length constraints and 
finally Lemma 4 deals with regular and equational constraints, ignoring the other 
possible constraints and the length of substitutions for variables are the same. 
Hence it holds also when the length constraints are allowed then the resulting 
word equations use regular and length constraints. 

Unfortunately, the decidability of word equations with linear length con- 
straints (even without regular constraints) is a notorious open problem. Thus 
instead of decidability, we get Turing-equivalent problems. 
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Theorem 5. Deciding regular, equational and length constraints for T- 
sequences of a decidable theory T is Turing-equivalent to word equations with 
regular and length constraints. 


Automata Aware of the Sequence Lengths. We now consider the case when 
the underlying theory Te is the Presburger arithmetic, i.e. G is the natural 
numbers and we can use addition, constants 0,1 and comparisons (and vari- 
ables). The additional functionality of the parametric automaton A is that 
ACrinQ x T(curr, ¥, L) x Q, i.e. the guards can also use the length variables; 
the semantics is extended in the natural way. 

Then the type type,(a) of a € N now depends on p values on ¥ and £, hence 
we denote by 7 the restriction of u to Y UL. Then Lemma 2, 3 still hold, when 
we fix 7. Similarly, Lemma 4 holds, but the analogue of (1) now uses also the 
length variables, which are also used in the length constraints. Such a formula 
can be seen as a collection of length constraints for original length variables £ 
as well as length variables ¥ U {a; : t € T}. Hence we validate this formula as 
part of the word equations with length constraints. Note that a; has two roles: 
as a letter in D, and as a length variable. However, the connection is encoded 
in the formula from the reduction (analogue of (1)) and we can use two different 
sets of symbols. 


Theorem 6. Deciding conjunction of regular, equational and length constraints 
for sequences of natural numbers with Presburger arithmetic, where the regular 
constraints can use length variables, is Turing-equivalent to word equations with 
regular and (up to exponentially many) length constraints. 


Undecidability of Register Automata Constraints. One could use more 
powerful automata for regular constraints; one such popular model are register 
automata; informally, such automaton has k registers r,,...,7, and its transi- 
tion depends on state and a value of formula using the registers and curr: the 
read value [23]; note that the registers can be updated: to curr or to one of 
register’s values; this is specified in the transition. In “classic” register automata 
guards can only use equality and inequality between registers and curr; in SRA 
model more powerful atoms are allowed. We show that sequence constraints and 
register automata constraints (which use quantifier-free formulas with equality 
and inequality as only atoms, i.e. do not employ the SRA extension) lead to 
undecidability (over infinite domain D). 


Theorem 7. Satisfiability of equational constraints and register automata con- 
straints, which use equality and inequality only, over infinite domain, is unde- 


cidable. 


7 Implementations, Optimizations and Benchmarks 


Implementation. We have implemented our decision procedure for problems 
in the constraint language SL for the theory of sequences in a new tool SECO 
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(Sequence Constraint Solver) on top of the SMT solver Princess [41]. We extend a 
publicly available library for symbolic automata and transducers [13] to paramet- 
ric automata and transducers by connecting them to the uninterpreted constants 
in our theory of sequences. Our tool supports symbolic transducers, concatena- 
tion of sequences and reversing of sequences. Any additional function which 
satisfies RegInvRel such as a replace function which replaces only the first and 
leftmost longest match can be added in the future. 

Our algorithm is an adaption of the tool OSTRICH [12] and closely follows 
the proof of Theorem 4. To summarize the procedure, a depth-first search is 
employed to remove all functions in the given input and splitting on the pre- 
images of those functions. When removing a function, new assertions are added 
to the pre-image constraints. After all functions have been removed and only 
assertions are left a nonemptiness check is called over all parametric automata 
which encoded the assertions. If the check is successful a corresponding model 
can be constructed, otherwise the procedure computes a conflict set and back- 
jumps to the last split in the depth search.” 


Benchmarks. We have performed experiments on two benchmark suites. The 
first one concerns itself with the verification of properties for programs manipu- 
lating sequences. The second benchmark suite compares our tool against an algo- 
rithm using symbolic register automata [13] on decision procedures of regular 
expressions with back-references such as emptiness, equivalence and inclusion. 

Both benchmark suites require universal quantification over the parameters; 
there are existing methods for eliminating these universal quantifiers, one such 
class are the semantically deterministic (SD) [22] PAs; despite its name, being 
SD is algorithmically checkable. Most of considered the PAs are SD, in particular 
all in benchmark suite 2. 

Experiments were conducted on an AMD Ryzen 5 1600 Six-Core CPU with 
16 GB of RAM running on Windows 10. The results for second benchmark suite 
is shown Table 1. The timeout for all benchmarks is 300s. 

In the first benchmarks suite we are looking to verify a weaker form of the 
permutation property of sorting as shown in Sect.2. Furthermore, we verify 
properties of two self-stabilizing algorithms for mutual exclusion on parameter- 
ized systems. The first one is Lamport’s bakery algorithm [33], for which we 
proved that the algorithm ensures mutual exclusion. The system is modelled in 
the style of regular model checking [8], with system states represented as words, 
here over an infinite alphabet: the character representing a thread stores the 
thread control state, a Boolean flag, and an integer as the number drawn by 
the thread. The system transitions are modelled as parametric transducers, and 
invariants as parametric automata. The second algorithm is known as Dijkstra’s 
SelfStabilizing Protocol [20], in which system states are encoded as sequences 
of integers, and in which we verify that the set of states in which exactly one 
processor is privileged forms an invariant. The mentioned benchmarks require 


? For a more detailed write-up of the depth-first search algorithm see OSTRICH [12] 
Algorithm 1. 
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Table 1. Benchmark suite 2. SRA is used for the algorithm for symbolic register 
automata and SEQ for our tool. The symbol Ø indicates the column where emptiness 
was checked, = indicates self equivalence and C inclusion of languages. 


Li Le SRAg(Li) SECOg(L1) | SRA=(L1) | SECo=(L1) | SRAc (L2, £1) | SECOc (L2, £1) 
Pr-C2 |Pr-CL2 | 0.03s 0.65s 0.43s 0.10s 4.78s 0.10s 
Pr-C3 | Pr-CL3 0.58s 0.70s 10.73 s 0.12s 36.90 s 0.10s 
Pr-C4 | Pr-CL4| 18.40s 0.778 98.38 s 0.14s = 0.10s 
Pr-C6 |Pr-CL6 |- 1.00s = 0.12s = 0.10s 
Pr-CL2| Pr-C2 /0.33s 0.30s 1.03s 0.13s 0.52s 0.76s 
Pr-CL3 | Pr-C3 |14.04s 0.38s 20.44s 0.13s 10.52s 0.76s 
Pr-CL4 | Pr-C4 |- 0.41s 0.43 s 0.12s = 0.82s 
Pr-CL6 | Pr-C6 |- 0.62s 0.43s 0.12s = 1.27s 
IP-2 IP-3 0.11s 1.53s 0.63s 0.14s 2.43s 0.15s 
IP-3 IP-4 1.83s 1.45s 4.66s 0.14s 28.60s 0.17s 
IP-4 IP-6 30.33 s 1.75s 80.03 s 0.14s = 0.17s 
IP-6 IP-9 = 1.60s 0.43s 0.13s = 0.17s 
universal quantification, but similar to the motivating example from Sect. 2 one 


can eliminate quantifiers by Skolemization and instantiation which was done by 
hand. 

The second benchmark suite consists of three different types of benchmarks, 
summarized in Table 1. The benchmark PR-Cn describes a regular expression 
for matching products which have the same code number of length n, and PR- 
CLn matches not only the code number but also the lot number. The last type 
of benchmark is IP-n, which matches n positions of 2 IP addresses. The bench- 
marks are taken from the regular-expression crowd-sourcing website RegExLib 
[39] and are also used in experiments for symbolic register automata |14] which 
we also compare our results against. To apply our decision procedure to the 
benchmarks, we encode each of the benchmarks as a parametric automaton, 
using parameters for the (bounded-size) back-references. The task in the exper- 
iments is to check emptiness, language equivalence, and language inclusion for 
the same combinations of the benchmarks as considered in [14]. 


Results of the Experiments. All properties can be encoded by parametric 
automata with very few states and parameters. As a result the properties for 
each program can be verified in < 2.6 s, in detail the property for Dijkstra’s algo- 
rithm was proven in 0.6s, QuickSort in 1.1s and Lamport’s bakery algorithm in 
2.5s. 

The results for the second benchmark suite are shown in Table 1. The algo- 
rithm for symbolic register automata times out on 11 of the 36 benchmarks and 
our tool solves most benchmarks in <1s. One thing to observe that the symbolic 
register automata scales poorly when more registers are needed to capture the 
back-references while the performance of our approach does not change notice- 
ably when more parameters are introduced. 
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8 Conclusion and Future Work 


In this paper, we have performed a systematic investigation of decidability and 
complexity of constraints on sequences. Our starting point is the subcase of 
string constraints (i.e. over a finite set of sequence elements), which include equa- 
tional constraints with concatenation, regular constraints, length constraints, 
and transducers. We have identified parametric automata (extending symbolic 
automata and variable automata) as suitable notion of “regular constraints” over 
sequences, and parametric transducers (extending symbolic transducers) as suit- 
able notion of transducers over sequences. We showed that decidability results in 
the case of strings carry over to sequences, although the complexity is in general 
higher than in the case of strings (sometimes exponentially higher). For certain 
element theory (e.g. Linear Real Arithmetic), it is possible to retain the same 
complexity as in the string case. We also delineate the boundary of the suitable 
notion of “regular constraints” by showing that the equational constraints with 
symbolic register automata [14] yields undecidable satisfiability. Finally, our new 
sequence solver SECO shows promising experimental results. 

There are several future research avenues. Firstly, the complexity of sequence 
constraints over other specific element theories (e.g. Linear Integer Arithmetic) 
should be precisely determined. Secondly, is it possible to recover decidability 
with other fragments of register automata (e.g., single-use automata [7|)? On 
the implementation side, there are some algorithmic improvements, e.g., better 
nonemptiness checks for parametric automata in the case of a single automaton, 
as well as product of multiple automata. 


Acknowledgment. We thank anonymous reviewers for their thorough and helpful 
feedback. We are grateful to Nikolaj Bjgrner, Rupak Majumdar and Margus Veanes 
for the inspiring discussion. 
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Abstract. We formulate, in lattice-theoretic terms, two novel algo- 
rithms inspired by Bradley’s property directed reachability algorithm. 
For finding safe invariants or counterexamples, the first algorithm 
exploits over-approximations of both forward and backward transition 
relations, expressed abstractly by the notion of adjoints. In the absence 
of adjoints, one can use the second algorithm, which exploits lower sets 
and their principals. As a notable example of application, we consider 
quantitative reachability problems for Markov Decision Processes. 


Keywords: PDR - Lattice theory - Adjoints - MDPs - 
Over-approximation 


1 Introduction 


Property directed reachability analysis (PDR) refers to a class of verification 
algorithms for solving safety problems of transition systems [5,12]. Its essence 
consists of 1) interleaving the construction of an inductive invariant (a positive 
chain) with that of a counterexample (a negative sequence), and 2) making the 
two sequences interact, with one narrowing down the search space for the other. 
PDR algorithms have shown impressive performance both in hardware and 
software verification, leading to active research [15,18,28,29] going far beyond 
its original scope. For instance, an abstract domain [8] capturing the over- 
approximation exploited by PDR has been recently introduced in [13], while 
PrIC3 [3] extended PDR for quantitative verification of probabilistic systems. 
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To uncover the abstract principles behind PDR and its extensions, Kori et 
al. proposed LT-PDR [19], a generalisation of PDR in terms of lattice/category 
theory. LT-PDR can be instantiated using domain-specific heuristics to create 
effective algorithms for different kinds of systems such as Kripke structures, 
Markov Decision Processes (MDPs), and Markov reward models. However, the 
theory in [19] does not offer guidance on devising concrete heuristics. 


Adjoints in PDR. Our approach shares the same vision of LT-PDR, but we 
identify different principles: adjunctions are the core of our toolset. 


An adjunction f 4 g is one of the central concepts in category f 
theory [23]. It is prevalent in various fields of computer science, AS 1 C 
too, such as abstract interpretation [8] and functional program- g 


ming [22]. Our use of adjoints in this work comes in the following two flavours. 


— (forward-backward adjoint) f describes the forward semantics of a transition 
system, while g is the backward one, where we typically have A = C. 

— (abstraction-concretization adjoint) C is a concrete semantic domain, and A 
is an abstract one, much like in abstract interpretation. An adjoint enables 
us to convert a fixed-point problem in C to that in A. 


Our Algorithms. The problem we address is the standard lattice theoretical 
formulation of safety problems, namely whether the least fixed point of a con- 
tinuous map b over a complete lattice (L, E) is below a given element p € L. In 


symbols ub E? p. We present two algorithms. 

The first one, named AdjointPDR, assumes to have an ele- f 
ment i € L and two adjoints f 4 g: L — L, representing respec- L Le L 
tively initial states, forward semantics and backward semantics g 


(see right) such that b(x) = f(x) Ui for all x € L. Under this assumption, we 
have the following equivalences (they follow from the Knaster-Tarski theorem, 
see §2): 

WbEp & pwfui)Cp & iCv(gnp), 


where u(f Ui) and v(gM p) are, by the Kleene theorem, the limits of the initial 
and final chains illustrated below. 


LEC f(UIE- =- Cg(p)NpEpE] 


As positive chain, PDR exploits an over-approximation of the initial chain: it is 
made greater to accelerate convergence; still it has to be below p. 

The distinguishing feature of AdjointPDR is to take as a negative sequence 
(that is a sequential construction of potential counterexamples) an over- 
approximation of the final chain. This crucially differs from the negative sequence 
of LT-PDR, namely an under-approximation of the computed positive chain. 

We prove that AdjointPDR is sound (Theorem 5) and does not loop (Propo- 
sition 7) but since, the problem ub E? p is not always decidable, we cannot prove 
termination. Nevertheless, AdjointPDR allows for a formal theory of heuris- 
tics that are essential when instantiating the algorithm to concrete problems. 
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The theory prescribes the choices to obtain the boundary executions, using 
initial and final chains (Proposition 10); it thus identifies a class of heuristics 
guaranteeing termination when answers are negative (Theorem 12). 

AdjointPDR’s assumption of a forward-backward adjoint f 4 g, however, does 
not hold very often, especially in probabilistic settings. Our second algorithm 
AdjointPDR! circumvents this problem by extending the lattice for the negative 
sequence, from L to the lattice L! of lower sets in L. 

Specifically, by using the second form of T 
adjoints, namely an abstraction-concretization 
pair, the problem ub E? p in L can be trans- 6 oaks. OLE 
lated to an equivalent problem on b! in L+}, for (o) 
which an adjoint b! 4 bl is guaranteed. This allows one to run AdjointPDR in 
the lattice L}. We then notice that the search for a positive chain can be con- 
veniently restricted to principals in L!, which have representatives in L. The 
resulting algorithm, using L for positive chains and L! for negative sequences, 
is AdjointPDR!. 

The use of lower sets for the negative sequence is a key advantage. It not 
only avoids the restrictive assumption on forward-backward adjoints f 4 g, but 
also enables a more thorough search for counterexamples. AdjointPDR! can sim- 
ulate step-by-step LT-PDR (Theorem 17), while the reverse is not possible due 
to a single negative sequence in AdjointPDR! potentially representing multiple 
(Proposition 18) or even all (Proposition 19) negative sequences in LT-PDR. 


Concrete Instances. Our lattice-theoretic algorithms yield many concrete 
instances: the original IC3/PDR [5,12] as well as Reverse PDR [27] are instances 
of AdjointPDR with L being the powerset of the state space; since LT-PDR can 
be simulated by AdjointPDR!, the latter generalizes all instances in [19]. 

As a notable instance, we apply AdjointPDR! to MDPs, specifically to decide 
if the maximum reachability probability [1] is below a given threshold. Here 
the lattice L = [0,1]° is that of fuzzy predicates over the state space S. Our 
theory provides guidance to devise two heuristics, for which we prove negative 
termination (Corollary 20). We present its implementation in Haskell, and its 
experimental evaluation, where comparison is made against existing probabilistic 
PDR algorithms (PrIC3 [3], LT-PDR [19]) and a non-PDR one (Storm [11]). The 
performance of AdjointPDR! is encouraging—it supports the potential of PDR 
algorithms in probabilistic model checking. The experiments also indicate the 
importance of having a variety of heuristics, and thus the value of our adjoint 
framework that helps coming up with those. 

Additionally, we found that abstraction features of Haskell allows us to code 
lattice-theoretic algorithms almost literally (~100 lines). Implementing a few 
heuristics takes another ~240 lines. This way, we found that mathematical 
abstraction can directly help easing implementation effort. 


Related Work. Reverse PDR [27] applies PDR from unsafe states using a back- 
ward transition relation T and tries to prove that initial states are unreachable. 
Our right adjoint g is also backward, but it differs from T in the presence of 
nondeterminism: roughly, T(X) is the set of states which can reach X in one 
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step, while g(X) are states which only reach X in one step. fo1PDR [28,29] runs 
PDR and Reverse PDR in parallel with shared information. Our work uses both 
forward and backward directions (the pair f 4 g), too, but approximate differ- 
ently: Reverse PDR over-approximates the set of states that can reach an unsafe 
state, while we over-approximate the set of states that only reach safe states. 

The comparison with LI-PDR [19] is extensively discussed in Sect. 4.2. 
PrIC3 [3] extended PDR to MDPs, which are our main experimental ground: 
Sect. 6 compares the performances of PrIC3, LT-PDR and AdjointPDR!. 

We remark that PDR has been applied to other settings, such as soft- 
ware model checking using theories and SMT-solvers [6,21] or automated plan- 
ning [30]. Most of them (e.g., software model checking) fall already in the gen- 
erality of LT-PDR and thus they can be embedded in our framework. 

It is also worth to mention that, in the context of abstract interpretation, the 
use of adjoints to construct initial and final chains and exploit the interaction 
between their approximations has been investigated in several works, e.g., [7]. 


Structure of the Paper. After recalling some preliminaries in Sect.2, we 
present AdjointPDR in Sect.3 and AdjointPDR! in Sect. 4. In Sect. 5 we introduce 
the heuristics for the max reachability problems of MDPs, that are experimen- 
tally tested in Sect. 6. 


2 Preliminaries and Notation 


We assume that the reader is familiar with lattice theory, see, e.g., [10]. We use 
(LZ, ©), (£1, E1), (L2, E2) to range over complete lattices and x, y, z to range over 
their elements. We omit subscripts and order relations whenever clear from the 
context. As usual, |_| and [] denote least upper bound and greatest lower bound, 
U and M denote join and meet, T and L top and bottom. Hereafter we will tacitly 
assume that all maps are monotone. Obviously, the identity map id: L — L 
and the composition f o g: Lı — L3 of two monotone maps g: Lı — La and 
f: La — L3 are monotone. For a map f: L — L, we inductively define f° = id 
and f"t! = f o f”. Given l: Lı —> Ly and r: Ly — Lı, we say that l is the 
left adjoint of r, or equivalently that r is the right adjoint of l, written l 4 r, 
when it holds that I(x) Es y iff x Cy r(y) for all x € Lı and y € Lg. Given a 
map f: L > L, the element x € L is a post-fixed point iff x E f(x), a pre-fixed 
point iff f(a) E x and a fixed point iff x = f(x). Pre, post and fixed points form 
complete lattices: we write uf and vf for the least and greatest fixed point. 
Several problems relevant to computer science can be reduced to check if 
ub E p for a monotone map b: L — L on a complete lattice L. The Knaster- 
Tarski fixed-point theorem characterises ub as the least upper bound of all pre- 
fixed points of b and vb as the greatest lower bound of all its post-fixed points: 


pb=[ {z | (x) E z} vb=| [{x| aC d(x} . 


This immediately leads to two proof principles, illustrated below: 


de, (a) al p da, iC aC b(x) 


pb CE p i D vb 


Exploiting Adjoints in PDR 45 


e 
Ca aa 
s i k 


Fig. 1. The transition system of Example 1, with S = {so,...s¢} and I = {so}. 


By means of (KT), one can prove ub E p by finding some pre-fixed point x, often 
called invariant, such that x EC p. However, automatically finding invariants 
might be rather complicated, so most of the algorithms rely on another fixed- 
point theorem, usually attributed to Kleene. It characterises ub and vb as the 
least upper bound and the greatest lower bound, of the initial and final chains: 


LOL)CRP(L)C--- and ---Cb?(T)CW(T)ET. That is, (KI) 
jo = | |a), vb = | | (T). 
nen neN 


The assumptions are stronger than for Knaster-Tarski: for the leftmost state- 
ment, it requires the map b to be w-continuous (i.e., it preserves |_| of w-chains) 
and, for the rightmost w-co-continuous (similar but for []). Observe that every 
left adjoint is continuous and every right adjoint is co-continuous (see e.g. [23]). 

As explained in [19], property directed reachability (PDR) algorithms [5] 
exploits (KT) to try to prove the inequation and (K1) to refute it. In the algo- 
rithm we introduce in the next section, we further assume that b is of the form 
f Ui for some element i € L and map f: L — L, namely b(a) = f(x) U i for all 
x E€ L. Moreover we require f to have a right adjoint g: L — L. In this case 


(fU) Ep iff ilv(gnp) (1) 


(which is easily shown using the Knaster-Tarski theorem) and (fz) and (grp) 
are guaranteed to be (co)continuous. Since f 4 g and left and right adjoints 
preserve, resp., arbitrary joins and meets, then for all n € N 


(FUL) = Ujenn PO ODT) = Men 9?) (2) 
which by (KI) provide useful characterisations of least and greatest fixed points. 
a Ui) = Unen O vig np) =f leng" (p) (KH) 


We conclude this section with an example that we will often revisit. It also 
provides a justification for the intuitive terminology that we sporadically use. 


Example 1 (Safety problem for transition systems). A transition system consists 
of a triple (S, I, ô) where S is a set of states, I C S is a set of initial states, and 
6: S — PS is a transition relation. Here PS denotes the powerset of S, which 
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to = Lt (10 Ify Ae then pC yn-1 (N1) 
bsksn (M Vj € [k,n — 2], o(yj41) E yy (N2) 
Vj € [0,n — 2], x; Cajqi (12 


Vj € [k,n — 1], z; Z y; (PN) 
Vj € [0,n—1], (fU Ðİ (L) E æj E (gn p)” (T) (Al) 
Vj € [Ln — 1], wj-1 Eg" (p) (A2) 


iC zı (P1 

In-2Cp (P2 
Vj € [0,n — 2], f(x;) E zj+ı1 (P3 , 
Vj € [0,n — 2], x; E g(xj+1) (P3a vj € [k,n — 1], 9" 'F(p) E y; (43) 


Fig. 2. Invariants of AdjointPDR. 


forms a complete lattice ordered by inclusion C. By defining F: PS — PS as 


F(X) def U.ex ô(s) for all X € PS, one has that u(F U I) is the set of all 
states reachable from I. Therefore, for any P € PS, representing some safety 
property, w(f U I) C P holds iff all reachable states are safe. It is worth to 


remark that F has a right adjoint G: PS — PS defined for all X € PS as 


G(X) © {s | 6(s) C X}. Thus by (1), u(FUT) C P iff I C v(GN P). 


Consider the transition system in Fig. 1. Hereafter we write Sj for the set 
of states {s9,51,...,5;} and we fix the set of safe states to be P = S5. It is 
immediate to see that u(F U I) = S4 C P. Automatically, this can be checked 
with the initial chains of (F UJ) or with the final chain of (GNM P) displayed 
below on the left and on the right, respectively. 


Ø CIC S2 C S3 CSSC.  CS,CS,CPCS 


The (j + 1)-th element of the initial chain contains all the states that can be 
reached by J in at most j transitions, while (j + 1)-th element of the final chain 
contains all the states that in at most 7 transitions reach safe states only. 


3 Adjoint PDR 


In this section we present AdjointPDR, an algorithm that takes in input a tuple 
(i, f,g,p) with i,p € L and f 4g: L — L and, if it terminates, it returns true 
whenever (f Ui) E p and false otherwise. 


The algorithm manipulates two sequences of elements of L: æ = Lojane ni 


of length n and y = Yky---Yn—-1 Of length n — k. These satisfy, through the 
executions of AdjointPDR, the invariants in Fig.2. Observe that, by (Al), x; 
over-approximates the j-th element of the initial chain, namely (f Ui) (L) E £j, 
while, by (A3), the j-indexed element y; of y over-approximates g"~/~1!(p) that, 
borrowing the terminology of Example 1, is the set of states which are safe in 
n — j — 1 transitions. Moreover, by (PN), the element y; witnesses that x; is 
unsafe, i.e., that 2; Z g"~' (p) or equivalently f”! (xj) Z p. Notably, æ is 
a positive chain and y a negative sequence, according to the definitions below. 
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AdjointPDR (i, f, g, p) 


<INITIALISATION> 
(wlly)na += (1, TIle)2,2 
<ITERATION> 4 x,y not conclusive 
case (ally)nxn of 
y =€ and t-1Cp: %, (Unfold) 
(e||y)ne := (x, Tile)ntinti 
y=eand trip: % (Candidate) 
choose z€ L such that zn-ı Zz and plz; 
(elly)ne = (all2)nn—1 
ye and f(£k-1ı) Z Yk : % (Decide) 
choose z€ [L such that zk-ı Zz and g(yx) Cz; 
(ælly)n := (ælļlz, Y)n,k-1 
y Że and f(xe-1) Cyr : % (Conflict) 
choose z€ L such that zCly, and (fUi)(a,-102z)C z; 
(wlly)ne += (æ Nk z||tail(y))n, +1 
endcase 
<TERMINATION> 
if Jj € [0,n-— 2].xj+ı E aj then return true % æ conclusive 
if iZyi then return false % y conclusive 


Fig. 3. AdjointPDR algorithm checking p(f Ui) E p. 


Definition 2 (positive chain). A positive chain for u(f Ui) E p is a finite 
chain xo C++: E £n—1 in L of length n > 2 which satisfies (P1), (P2), (P3) in 
Fig. 2. It is cade if vj41 E zj for some j < n— 2. 


In a conclusive positive chain, zj+ı provides an invariant for f Ui and thus, 
by (KT), u(f Ui) E p holds. So, when g is conclusive, AdjointPDR returns true. 


Definition 3 (negative sequence). A negative sequence for u(f Ui) Ep is 
a finite sequence Yk, ...,Yn—ı in L with1 < k < n which satisfies (N1) and (N2) 
in Fig. 2. It is conclusive if k = 1 and i Z yı. 


When y is conclusive, AdjointPDR returns false as yı provides a counterex- 
ample: (N1) and (N2) entail (A3) and thus i Z yı 3 g"~?(p). By (KHH), 
g"~*(p) I v(gN p) and thus i Z v(g Mp). By (1), (f U i) Z p. 

The pseudocode of the algorithm is displayed in Fig.3, where we write 
(£||Y)n,k to compactly represents the state of the algorithm: the pair (n, k) is 
called the index of the state, with aw of length n and y of length n — k. When 
k = n, y is the empty sequence ¢. For any z € L, we write æ, z for the chain 
£o, .--, Zn—1,Z Of length n + 1 and z,y for the sequence z, yz,...Yn—1 of length 


n—(k—1). Moreover, we write æM; z for the chain zoz, ..., £42, Uj41,---,Un-1. 
Finally, tail(y) stands for the tail of y, namely Yeti, ..-Yn—1 Of length n—(k+1). 
The algorithm starts in the initial state so £ det y T||e)2,2 and, unless one 


of x and y is conclusive, iteratively applies one of the four mutually exclusive 
rules: (Unfold), (Candidate), (Decide) and (Conflict). The rule (Unfold) extends 
the positive chain by one element when the negative sequence is empty and the 
positive chain is under p; since the element introduced by (Unfold) is T, its 
application typically triggers rule (Candidate) that starts the negative sequence 
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with an over-approximation of p. Recall that the role of y; is to witness that 
x; is unsafe. After (Candidate) either (Decide) or (Conflict) are possible: if yx 
witnesses that, besides x, also f(#,_1) is unsafe, then (Decide) is used to further 
extend the negative sequence to witness that 2,_1 is unsafe; otherwise, the rule 
(Conflict) improves the precision of the positive chain in such a way that yx no 
longer witnesses x; M z unsafe and, thus, the negative sequence is shortened. 
Note that, in (Candidate), (Decide) and (Conflict), the element z € L is 
chosen among a set of possibilities, thus AdjointPDR is nondeterministic. 


To illustrate the executions of the algorithm, we adopt a labeled transition 
def 


system notation. Let S = {(xly)n4 | n > 2, k <n, x € L” and y € L"*} be 
the set of all possible states of AdjointPDR. We call (æ||y)n, € S conclusive if 
x or y are such. When s € S is not conclusive, we write s to mean that s 


satisfies the guards in the rule (Decide), and s,s! to mean that, being (Decide) 
applicable, AdjointPDR moves from state s to s’ by choosing z. Similarly for the 
other rules: the labels Ca, Co and U stands for (Candidate), (Conflict) and 
(Unfold), respectively. When irrelevant we omit to specify labels and choices 
and we just write s — s’. As usual —* stands for the transitive closure of > 
while —* stands for the reflexive and transitive closure of —. 


Example 4. Consider the safety problem in Example 1. Below we illustrate two 
possible computations of AdjointPDR that differ for the choice of z in (Conflict). 
The first run is conveniently represented as the following series of transitions. 


(0, Slle)22 Sp (0, S||P)21 Sr (0, Tle)2,2 4 0,1, Slle)s.s Sp (0,1, SI|P)3.2 
Bs, (Ø, I, Salle)s,3 > Sp (0,1, S2, S\|P)43 Ss, (0,1, S2, Sslle)aa Sp (0,1, S2, S3, S||P)s,4 
Bsa (h, I, S2, S3, Salle)s,5 Sp (0, I, S2, 53, Sa, $||P)6,5 Ss, (0, 1, S2, S3, Sa, Salle)e,e 


The last state returns true since 74 = x5 = S4. Observe that the elements of 
x, with the exception of the last element £n—1, are those of the initial chain of 
(FU T), namely, x; is the set of states reachable in at most j — 1 steps. In the 
second computation, the elements of x are roughly those of the final chain of 
(GNP). More precisely, after (Unfold) or (Candidate), £n—; for j < n— 1 is the 
set of states which only reach safe states within j steps. 


(0, Slle)2,2 Sr (0, S||P)21 Sp (0, Plle)2,2 
4.3 >(0, P, S||P)s,2 ss (0, P, S||S4, P)3,1 Fs, (0, S4, SI|P)3,2 Sp (0, S4, Plle)s.3 
{3 p(0, Sa, P, S\|P)4,3 Ss, (0, Sa, P, Sl|S4, P)a2 Ss, (0, Sa, S4, SI|P)4,3 


Observe that, by invariant (A1), the values of x in the two runs are, respectively, 
the least and the greatest values for all possible computations of AdjointPDR. 


Theorem 5.1 follows by invariants (12), (P1), (P3) and (KT); Theorem 5.2 
by (N1), (N2) and (K14). Note that both results hold for any choice of z. 


Theorem 5 (Soundness). AdjointPDR is sound. Namely, 


1. If AdjointPDR returns true then u(f Ui) Cp. 
2. If AdjointPDR returns false then u(f Ui) Z p. 
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3.1 Progression 


It is necessary to prove that in any step of the execution, if the algorithm does 
not return true or false, then it can progress to a new state, not yet visited. 
To this aim we must deal with the subtleties of the non-deterministic choice of 
the element z in (Candidate), (Decide) and (Conflict). The following proposition 
ensures that, for any of these three rules, there is always a possible choice. 


Proposition 6 (Canonical choices). The following are always possible: 
1. in (Candidate) z = p; 3. in (Conflict) z = yx; 

2. in (Decide) z = g(yk); 4. in (Conflict) z = (f U i)(£k—1). 
Thus, for all non-conclusive s € S, if so —>* s then s —>. 


Then, Proposition 7 ensures that AdjointPDR always traverses new states. 
Proposition 7 (Impossibility of loops). If so =>* s >+ s’, then s £ s'. 


Observe that the above propositions entail that AdjointPDR terminates 
whenever the lattice L is finite, since the set of reachable states is finite in 
this case. 


Example 8. For (I, F,G,P) as in Example 1, AdjointPDR behaves essentially 
as IC3/PDR [5], solving reachability problems for transition systems with finite 
state space S. Since the lattice PS is also finite, AdjointPDR always terminates. 


3.2 Heuristics 


The nondeterministic choices of the algorithm can be resolved by using heuristics. 
Intuitively, a heuristic chooses for any states s € S an element z € L to be 
possibly used in (Candidate), (Decide) or (Conflict), so it is just a function 
h: S — L. When defining a heuristic, we will avoid to specify its values on 
conclusive states or in those performing (Unfold), as they are clearly irrelevant. 

With a heuristic, one can instantiate AdjointPDR by making the choice 
of z as prescribed by h. Syntactically, this means to erase from the code of 
Fig. 3 the three lines of choose and replace them by z:= h((æ||c)n,x ). We call 
AdjointPDR, the resulting deterministic algorithm and write s—,s’ to mean 


that AdjointPDR, moves from state s to s’. We let S” def {s E€ S | so~} s} be 
the sets of all states reachable by AdjointPDR}. 


Definition 9 (legit heuristic). A heuristic h: S — L is called legit whenever 
for all s,s! € S", if s,s’ then s > 8’. 


When hf is legit, the only execution of the deterministic algorithm AdjointPDR, 
is one of the possible executions of the non-deterministic algorithm AdjointPDR. 

The canonical choices provide two legit heuristics: first, we call simple any 
legit heuristic h that chooses z in (Candidate) and (Decide) as in Proposition 6: 


aly), col? Elu) S 3 
PAUA Ca if (ell) nak > " 
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Then, if the choice in (Conflict) is like in Proposition 6.4, we call h initial; if 
it is like in Proposition 6.3, we call h final. Shortly, the two legit heuristics are: 


simple initial | (3) and (ally)na œ (F Ui (£k-1) if (ælly)n,x € Co 


simple final (3) and (ælļ|Y)n,k = Yk if (a||y)n,~ € Co 
Interestingly, with any simple heuristic, the sequence y takes a familiar shape: 


Proposition 10. Leth: S > L be any simple heuristic. For all (æ||y)n p € S”, 
invariant (A3) holds as an equality, namely for all j € [k,n—1], yj = g"~' (p). 


By the above proposition and (A3), the negative sequence y occurring in the 
execution of AdjointPDR,,, for a simple heuristic h, is the least amongst all the 
negative sequences occurring in any execution of AdjointPDR. 

Instead, invariant (A1) informs us that the positive chain æ is always in 
between the initial chain of f Ui and the final chain of g M p. Such values of x 
are obtained by, respectively, simple initial and simple final heuristic. 


Example 11. Consider the two runs of AdjointPDR in Example 4. The first one 
exploits the simple initial heuristic and indeed, the positive chain x coincides 
with the initial chain. Analogously, the second run uses the simple final heuristic. 


3.3 Negative Termination 


When the lattice L is not finite, AdjointPDR may not terminate, since checking 
u(f Ut) E pis not always decidable. In this section, we show that the use of 
certain heuristics can guarantee termination whenever p(f U i) Z p. 

The key insight is the following: if u(f Ui) Z p then by (K1), there should 
exist some ñ € N such that (f Ui)"(L) Z p. By (A1), the rule (Unfold) can be 
applied only when (f Ui)"~1(L) E x,_1 E p. Since (Unfold) increases n and n 
is never decreased by other rules, then (Unfold) can be applied at most ñ times. 

The elements of negative sequences are introduced by rules (Candidate) and 
(Decide). If we guarantee that for any index (n,k) the heuristic in such cases 


returns a finite number of values for z, then one can prove termination. To make 


this formal, we fix CaDh , = (xlly)nz € S” | (xllY)n,k SS or (a||Y) nk 2). 


i.e., the set of all (n, k)-indexed states reachable by AdjointPDR,, that trigger 
(Candidate) or (Decide), and h(CaDÈ} ,.) es {h(s) |s € CaDÈ ,}, i.e., the set of 


all possible values returned by h in such states. 


Theorem 12 (Negative termination). Let h be a legit heuristic. If 
h(CaD? ,) is finite for all n,k and u(f Ui) Z p, then AdjointPDR, terminates. 


Corollary 13. Let h be a simple heuristic. If u(f Ui) Z p, then AdjointPDR,, 
terminates. 


Note that this corollary ensures negative termination whenever we use the 
canonical choices in (Candidate) and (Decide) irrespective of the choice for (Con- 
flict), therefore it holds for both simple initial and simple final heuristics. 


Exploiting Adjoints in PDR 51 


4 Recovering Adjoints with Lower Sets 


In the previous section, we have introduced an algorithm for checking ub E p 
whenever b is of the form f Ui for an element i € L and a left-adjoint f: L — L. 
This, unfortunately, is not the case for several interesting problems, like the max 
reachability problem [1] that we will illustrate in Sect. 5. 

The next result informs us that, under standard assumptions, one can transfer 
the problem of checking ub E p to lower sets, where adjoints can always be 
defined. Recall that, for a lattice (L, E), a lower set is a subset X C L such that 
if x € X and 2’ E x then x’ € X; the set of lower sets of L forms a complete 
lattice (L+, C) with joins and meets given by union and intersection; as expected 
L is Í and T is L. Given b: L — L, one can define two functions bt, bl: L? — L! 


as b! (X) & b(X)! and bL(X) {zx | b(x) € X}. It holds that b! 4 bl. 
ote 
= l 
oC DL ARE Jee (4) 


()* 


In the diagram above, (—)!: a {x' | a’ Cx} and ||: L} — L maps a lower set 
X into | Hz | x € X}. The maps [| and (—)! form a Galois insertion, namely 
|| 4 (—)! and | |(—)! = id, and thus one can think of (4) in terms of abstract 
interpretation [8,9]: L! represents the concrete domain, L the abstract domain 
and b is a sound abstraction of b!. Most importantly, it turns out that b is 
forward-complete [4,14] w.r.t. bt, namely the following equation holds. 


(—)hob= bt o(-)! (5) 


Proposition 14. Let (L,C) be a complete lattice, p E€ L and b: L > L be a 
w-continuous map. Then bE p iff u(b! UL!) C pt. 


By means of Proposition 14, we can thus solve wb E p in L by run- 
ning AdjointPDR on (1!,b!,b!,p!). Hereafter, we tacitly assume that b is w- 
continuous. 


4.1 AdjointPDR!: Positive Chain in L, Negative Sequence in L} 


While AdjointPDR on (1!,b!,b!,p') might be computationally expensive, it 
is the first step toward the definition of an efficient algorithm that exploits a 
convenient form of the positive chain. 

A lower set X € L} is said to be a principal if X = x! for some x € L. 
Observe that the top of the lattice (Lt, C) is a principal, namely Tt, and that 
the meet (intersection) of two principals x! and yt is the principal (x N y)t. 

Suppose now that, in (Conflict), AdjointPDR(L!,b!,b!,p!) always chooses 
principals rather than arbitrary lower sets. This suffices to guarantee that all the 
elements of x are principals (with the only exception of xo which is constantly 
the bottom element of L! that, note, is Ø and not Lt). In fact, the elements of 
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AdjointPDR* (b, p) 


<INITIALISATION> 
(ællY)n := (0,1, T|le)s,3 
<ITERATION> 
case (æ||Y)n, of % x, Y not conclusive 
Y =6 and an-1Cp: %, (Unfold) 
(ællY)n := (æ, Tile)ntin4i 
Y =é and tni Pp : % (Candidate) 
choose ZEL* such that zn-1ı Z and pe Z; 
(@l¥)ne = (@llZ)nin—1 
Y #e and b(£k-1) Z Yp : %, (Decide) 
choose Z€ Ll such that x,p_1¢Z and bk (Ye) CZ; 
(a Yjet := Ll||Z, Y )n,k-1 
Y Fe and b(£k-1)€ Yp : 4(Conflict) 
choose z€ L such that z€Yx and b(zk-1ıNz)E z; 
(a Y jk := (£k z||tail(Y ))n,k+1 
endcase 
<TERMINATION> 
if Jj € [0,n-— 2].xj+ı CC aj then return true % æ conclusive 
if Yı =@ then return false h Y conclusive 


Fig. 4. The algorithm AdjointPDR! for checking ub C p: the elements of negative 
sequence are in L!, while those of the positive chain are in L, with the only exception 
of x9 which is constantly the bottom lower set @. For xo, we fix b(xo) = L. 


x are all obtained by (Unfold), that adds the principal Tt, and by (Conflict), 
that takes their meets with the chosen principal. 

Since principals are in bijective correspondence with the elements of L, by 
imposing to AdjointPDR(1!,b!,b!,p') to choose a principal in (Conflict), we 
obtain an algorithm, named AdjointPDR!, where the elements of the positive 
chain are drawn from L, while the negative sequence is taken in L!. The algo- 
rithm is reported in Fig.4 where we use the notation (a#||Y)n,, to emphasize 
that the elements of the negative sequence are lower sets of elements in L. 

All definitions and results illustrated in Sect. 3 for AdjointPDR are inherited! 
by AdjointPDR!, with the only exception of Proposition 6.3. The latter does not 
hold, as it prescribes a choice for (Conflict) that may not be a principal. In 
contrast, the choice in Proposition 6.4 is, thanks to (5), a principal. This means 
in particular that the simple initial heuristic is always applicable. 


Theorem 15. All results in Sect. 3, but Proposition 6.3, hold for AdjointPDR'. 


4.2 AdjointPDR! Simulates LT-PDR 


The closest approach to AdjointPDR and AdjointPDR! is the lattice-theoretic 
extension of the original PDR, called LT-PDR [19]. While these algorithms 
exploit essentially the same positive chain to find an invariant, the main differ- 
ence lies in the sequence used to witness the existence of some counterexamples. 


1 Up to a suitable renaming: the domain is (L‘, C) instead of (L, C), the parameters 
are L!,b!, bl, p? instead of i, f,g,p and the negative sequence is Y instead of y. 
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Definition 16 (Kleene sequence, from [19]). A sequence c = cx,...,Cn—1 Of 
elements of L is a Kleene sequence if the conditions (C1) and (C2) below hold. 
It is conclusive if also condition (CO) holds. 


(CO) cı E b(L), (C1) cn-1 É p, (C2) Vj € [k,n — 2]. cj+1 E b(c;). 


LT-PDR tries to construct an under-approximation Cn—1 of b”~?(L) that 
violates the property p. The Kleene sequence is constructed by trial and error, 
starting by some arbitrary choice of cy_1. 

AdjointPDR crucially differs from LT-PDR in the search for counterex- 
amples: LT-PDR under-approximates the final chain while AdjointPDR over- 
approximates it. The algorithms are thus incomparable. However, we can draw 
a formal correspondence between AdjointPDR! and LT-PDR by showing that 
AdjointPDR! simulates LT-PDR, but cannot be simulated by LT-PDR. In 
fact, AdjointPDR! exploits the existence of the adjoint to start from an over- 
approximation Y„—ı of p! and computes backward an over-approximation of the 
set of safe states. Thus, the key difference comes from the strategy to look for 
a counterexample: to prove ub Z p, AdjointPDR! tries to find Y,_; satisfying 
p € Yn-1 and ub Z Yn—-ı while LT-PDR tries to find c,_1 S-t. Cn-1 Z p and 
Cn—1 E pb. 


Theorem 17 below states that any execution of LT-PDR can be mimicked 
by AdjointPDR!. The proof exploits a map from LT-PDR’s Kleene sequences c 
to AdjointPDR!’s negative sequences neg(c) of a particular form. Let (L!, >) 
be the complete lattice of upper sets, namely subsets X C L such that 
X=xif {x €L | Av € X.x C2}. There is an isomorphism ~: (L',D) = 
(L',C) mapping each X C S into its complement. For a Kleene sequence 
C = Ch,.--,Cn—1 of LT-PDR, the sequence neg(c) def a(f{eg}"),...,7({en-1}") 
is a negative sequence, in the sense of Definition 3, for AdjointPDR!. Most impor- 
tantly, the assignment c +> neg(c) extends to a function, from the states of 
LT-PDR to those of AdjointPDR!, that is proved to be a strong simulation [24]. 


Theorem 17. AdjointPDR! simulates LT-PDR. 


Remarkably, AdjointPDR!’s negative sequences are not limited to the images 
of LT-PDR’s Kleene sequences: they are more general than the complement 
of the upper closure of a singleton. In fact, a single negative sequence of 
AdjointPDR! can represent multiple Kleene sequences of LT-PDR at once. Intu- 
itively, this means that a single execution of AdjointPDR! can correspond to 
multiple runs of LT-PDR. We can make this formal by means of the following 
result. 


Proposition 18. Let {c™}mem be a family of Kleene sequences. Then its point- 
wise intersection (mem neg(c™) is a negative sequence. 


The above intersection is pointwise in the sense that, for all j € [k,n — 1], 
: m def m m k 
it holds (mem neg(e”)); 2 Nnemlneg(e™)); = -({e | m € MY’): intu- 


itively, this is (up to neg(-)) a set containing all the M counterexamples. Note 
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that, if the negative sequence of AdjointPDR! makes (A3) hold as an equality, as 
it is possible with any simple heuristic (see Proposition 10), then its complement 
contains all Kleene sequences possibly computed by LT-PDR. 


Proposition 19. Letc be a Kleene sequence and Y be the negative sequence s.t. 
Y; = (b!)"-1-4(p!) for all j € [k,n — 1]. Then cj € =(Y;) for all j € [k,n — 1]. 


While the previous result suggests that simple heuristics are always the best 
in theory, as they can carry all counterexamples, this is often not the case in 
practice, since they might be computationally hard and outperformed by some 
smart over-approximations. An example is given by (6) in the next section. 


5 Instantiating AdjointPDR’ for MDPs 


In this section we illustrate how to use AdjointPDR! to address the max reach- 
ability problem [1] for Markov Decision Processes. 

A Markov Decision Process (MDP) is a tuple (A, S,s,,5) where A is a set of 
labels, S is a set of states, s, € S is an initial state, and 6: Sx A> DS+lisa 
transition function. Here DS is the set of probability distributions over S, namely 
functions d: S — [0,1] such that }),., d(s) = 1, and DS +1 is the disjoint union 
of DS and 1 = {x}. The transition function 6 assigns to every label a € A and 


to every state s € S either a distribution of states or x € 1. We assume that 


both S and A are finite sets and that the set Act(s) we {a € A | d(s,a) Æ *} of 


actions enabled at s is non-empty for all states. 

Intuitively, the maz reachability problem requires to check whether the proba- 
bility of reaching some bad states 3 C S is less than or equal to a given threshold 
à € [0,1]. Formally, it can be expressed in lattice theoretic terms, by consider- 
ing the lattice ((0,1)5,<) of all functions d: S — [0,1], often called frames, 
ordered pointwise. The max reachability problem consists in checking ub < p for 
p € [0,1]° and b: [0,1] — [0,1]5, defined for all d € [0,1]5 and s € S, as 


1 if s € Ø, 


aef JA ifs=s,, def 
p(s) = , b(d)(s) = h. ie X 
f if s £ s,, Rawle pa G )- d(s,a)(s’) ifs ¢ B. 


The reader is referred to [1] for all details. 


Since b is not of the form f Ui for a left adjoint f (see e.g. [19]), rather 
than using AdjointPDR, one can exploit AdjointPDR!. Beyond the simple ini- 
tial heuristic, which is always applicable and enjoys negative termination, we 
illustrate now two additional heuristics that are experimentally tested in Sect. 6. 

The two novel heuristics make the same choices in (Candidate) and (Decide). 
They exploit functions a: S — A, also known as memoryless schedulers, and the 
function ba: [0,1]% — [0,1]* defined for all d € [0,1]5 and s € S as follows: 


def 1 ifs € B; 
ba(d)(s) = i d(s') - 5(s,a(s))(s’) otherwise. 
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Since for all D € ([0,1]°)!, bk(D) = {d | b(d) € D} = Na {d | bald) € D} and 
since AdjointPDR! executes (Decide) only when b(2,_1) ¢ Yp, there should exist 
some a such that ba(£k—1) ¢ Yp. One can thus fix 


7 _, Je if (x||Y)n.z É 
oe Dees if (2Y Jn e > (6) 


Intuitively, such choices are smart refinements of those in (3): for (Candidate) 
they are exactly the same; for (Decide) rather than taking bl (Yp), we consider a 
larger lower-set determined by the labels chosen by a. This allows to represent 
each Y; as a set of d € [0,1]° satisfying a single linear inequality, while using 
bt(Yp) would yield a systems of possibly exponentially many inequalities (see 
Example 21 below). Moreover, from Theorem 12, it follows that such choices 
ensures negative termination. 


Corollary 20. Let h be a legit heuristic defined for (Candidate) and (Decide) 
as in (6). If ub £ p, then AdjointPDR';, terminates. 


Example 21. Consider the maximum reachability problem with threshold À = ; 
and 8 = {s3} for the following MDP on alphabet A = {a,b} and s, = so. 


1 
a,5 1 
2 
sa Zt ga oo, 
a 2 pi 
2 3 a5 


Hereafter we write d € [0,1]° as column vectors with four entries vg ...v3 and 
we will use - for the usual matrix multiplication. With this notation, the lower 
set pt € ((0,1]°)! and b: [0,1]5 — [0,1]% can be written as 


vo vo max(*15*2, SOT PZ) 
pt = dÈ |[1 0 0 of. | <j} and ndh -| totus | 
v3 v3 v3 BY 


Amongst the several memoryless schedulers, only two are relevant for us: Ç = 


(Spt? a, 81 a, 8219 b, 83 Ha) and € & (sq: b, sı a, s2 > b, s3 > a). 
By using the definition of ba: [0,1] — [0, 1]5, we have that 
T vytv T vot2v2 
ndë) fz] TEENE) [=>] 
vO 


~0 
1 


It is immediate to see that the problem has negative answer, since using ¢ in 
4 steps or less, sọ can reach s3 already with probability + + = 

To illustrate the advantages of (6), we run AdjointPDR! with the simple 
initial heuristic and with the heuristic that only differs for the choice in (Decide), 
taken as in (6). For both heuristics, the first iterations are the same: several 
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def vo vo vo vo 
poz CRAEN E Hale ee JAg l< [a]} 
v3 v3 v3 v3 
j def vo diis vg 1 vo vý 
i f 1 a 
F daR fats [Ep liea s dials eh 
z Nes ie 
2 def wo], [3002] 3 3 A 3 1 i 1 
Fo) Gallar IS ale || thes tle eal < Ial 
v3 i v3 a v3 v3 
o 330 o 
v0 1 0 3 Y v0 3 ug vo 
Pe dil Boal BIE kap dalies $ o]-]22 |< [o]} 
6 3 2 3 
U3 rn 2 > 0 Ug 4 v3 Ug 
g ogo a 
a aall Maa ool ee | x ho alala golea 
F =H ooro fea] loh = ohial oo Sho < [o]} 
v3 0001 v3 o o v3 v3 
FS 0 0 


Fig. 5. The elements of the negative sequences computed by AdjointPDR! for the MDP 
in Example 21. In the central column, these elements are computed by means of the 
simple initial heuristics, that is F’ = (b!)’(p'). In the rightmost column, these elements 
are computed using the heuristic in (6). In particular F’ = {d | be(d) € F’~"} fori < 3, 
while for i > 4 these are computed as F’ = {d | be(d) € F+}. 


repetitions of (Candidate), (Conflict) and (Unfold) exploiting elements of the 
positive chain that form the initial chain (except for the last element 2,_1). 


d | Joss ae d | joss Ae eae | [if fede 


In the latter state the algorithm has to perform (Decide), since b(z5) ¢ pt. 
Now the choice of z in (Decide) is different for the two heuristics: the former uses 
bi (pt) = {d | b(d) € pt}, the latter uses {d | b¢(d) € pt}. Despite the different 
choices, both the heuristics proceed with 6 steps of (Decide): 


17, : e 
g 
q | | | [fiA 33333 d| Jp A AAAA 
ib o afi 


al 
The element of the negative sequence F’ are illustrated in Fig. 5 for both the 
heuristics. In both cases, F° = Ø and thus AdjointPDR! returns false. 
To appreciate the advantages provided by (6), it is enough to compare the 
two columns for the F’ in Fig. 5: in the central column, the number of inequalities 
defining F’ significantly grows, while in the rightmost column is always 1. 


oooo 
errem 
oooo 
oooO 
= ONIA 


oooo 
HOOO 
= one o 
= ONA 
Hel 


Whenever Y; is generated by a single linear inequality, we observe that Yp = 
{d € [0,1]f | O,e6(rs d(s)) < r} for suitable non-negative real numbers r and 
rs for all s € S. The convex set Yp is generated by finitely many d € [0,1]% 
enjoying a convenient property: d(s) is different from 0 and 1 only for at most 
one s € S. The set of its generators, denoted by Gz, can thus be easily computed. 
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We exploit this property to resolve the choice for (Conflict). We consider its sub 
set Zk = {d € Gy | b(£k—1) < d} and define zg, zo1 € [0,1]* for all s € S as 


oath Zas) fre #0, Ze #0, 0) r [eC] frs =0,2 40 op 
b(£k—1)(s) otherwise zp(s) otherwise 

where, for u € [0,1], [u] denotes 0 if u = 0 and 1 otherwise. We call hCoB and 
hCo01 the heuristics defined as in (6) for (Candidate) and (Decide) and as zz, 
respectively zo1, for (Conflict). The heuristics hCo01 can be seen as a Boolean 
modification of hCoB, rounding up positive values to 1 to accelerate convergence. 


Proposition 22. The heuristics hCoB and hCo01 are legit. 


By Corollary 20, AdjointPDR! terminates for negative answers with both 
hCoB and hCo01. We conclude this section with a last example. 


Example 23. Consider the following MDP with alphabet A = {a,b} and s, = so 


a,l 


Clot ag 
a,1 Cs Sa SO Eo 51 => s3) a,l 
12 b, 


Nie 


and the max reachability problem with threshold \ = 2 and 8 = {s3}. The 
lower set pt € ((0,1]%)! and b: [0,1] — [0,1]* can be written as 


an m vd max(vo, 21422) 
p! = dè | [2 00 JE < [2}} and u(t) = vot2-v3 
v3 v3 v3 72 


With the simple initial heuristic, AdjointPDR! does not terminate. With the 
heuristic hCo01, it returns true in 14 steps, while with hCoB in 8. The first 4 
steps, common to both hCoB and hCo01, are illustrated below. 


e e 
sifet 


Observe that in the first (Conflict) zg = zo1, while in the second zo1(s1) = 1 
and zp(s1) = $, leading to the two different states prefixed by vertical lines. 


oooo 


oooo 
-— 
ll 
| Te | 
rFOoOO 
A 

N 
Il 
m 


oooo 
= o Om 
m o Bul 
= o ou 
H OWBNN 


oooo 


6 Implementation and Experiments 


We first developed, using Haskell and exploiting its abstraction features, a com- 
mon template that accommodates both AdjointPDR and AdjointPDR!. It is a 
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program parametrized by two lattices—used for positive chains and negative 
sequences, respectively—and by a heuristic. 

For our experiments, we instantiated the template to AdjointPDR! for MDPs 
(letting L = [0,1]°), with three different heuristics: hCoB and hCo01 from Propo- 
sition 22; and hCoS introduced below. Besides the template (~100 lines), we 
needed ~140 lines to account for hCoB and hCo01, and additional ~100 lines 
to further obtain hCoS. All this indicates a clear benefit of our abstract the- 
ory: a general template can itself be coded succinctly; instantiation to concrete 
problems is easy, too, thanks to an explicitly specified interface of heuristics. 

Our implementation accepts MDPs expressed in a symbolic format inspired 
by Prism models [20], in which states are variable valuations and transitions are 
described by symbolic functions (they can be segmented with symbolic guards 
{guard,;};). We use rational arithmetic (Rational in Haskell) for probabilities 
to limit the impact of rounding errors. 


Heuristics. The three heuristics (nCoB, hCo01, hCoS) use the same choices in 
(Candidate) and (Decide), as defined in (6), but different ones in (Conflict). 
The third heuristics hCoS is a symbolic variant of hCoB; it relies on our sym- 
bolic model format. It uses zg for z in (Conflict), where zs(s) = zp(s) if rs £0 
or Z, = Ú. The definition of zs(s) otherwise is notable: we use a piecewise affine 
function (t;-s+u;); for zs(s), where the affine functions (t;- s+ ui); are guarded 
by the same guards {guard; }; of the MDP’s transition function. We let the SMT 
solver Z3 [25] search for the values of the coefficients t;, u;, so that zg satisfies 
the requirements of (Conflict) (namely b(a,_1)(s) < zg(s) < 1 for each s € S 
with r, = 0), together with the condition b(zg) < zs for faster convergence. If 
the search is unsuccessful, we give up hCoS and fall back on the heuristic hCoB. 
As a task common to the three heuristics, we need to calculate Zk = {d € Gx, | 
b(a~-1) < d} in (Conflict) (see (7)). Rather than computing the whole set Gk 
of generating points of the linear inequality that defines Yk, we implemented an 
ad-hoc algorithm that crucially exploits the condition b(a,_ 1) < d for pruning. 


Experiment Settings. We conducted the experiments on Ubuntu 18.04 and 
AWS t2.xlarge (4 CPUs, 16 GB memory, up to 3.0 GHz Intel Scalable Processor). 
We used several Markov chain (MC) benchmarks and a couple of MDP ones. 


Research Questions. We wish to address the following questions. 


RQ1 Does AdjointPDR! advance the state-of-the-art performance of PDR algo- 
rithms for probabilistic model checking? 

RQ2 How does AdjointPDR!’s performance compare against non-PDR algo- 
rithms for probabilistic model checking? 

RQ3 Does the theoretical framework of AdjointPDR! successfully guide the 
discovery of various heuristics with practical performance? 

RQ4 Does AdjointPDR! successfully manage nondeterminism in MDPs (that 
is absent in MCs)? 


Experiments on MCs (Table 1). We used six benchmarks: Haddad-Monmege 
is from [17]; the others are from [3,19]. We compared AdjointPDR! (with three 
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Table 1. Experimental results on MC benchmarks. |S| is the number of states, P 
is the reachability probability (calculated by manual inspection), A is the threshold 
in the problem P <? A (shaded if the answer is no). The other columns show the 
average execution time in seconds; TO is timeout (900s); MO is out-of-memory. For 
AdjointPDR! and LT-PDR we used the tasty-bench Haskell package and repeated 
executions until std. dev. is < 5% (at least three execs). For PriIC3 and Storm, we 
made five executions. Storm’s execution does not depend on A: it seems to answer 
queries of the form P <? A by calculating P. We observed a wrong answer for the 
entry with (t) (Storm, sp.-num., Haddad-Monmege); see the discussion of RQ2. 


Benchmark |S] P à AdjointPDR! LT-PDR PrIC3 Storm 


hCoB hCo01 =hCoS none lin. pol. hyb.  sp.-num. sp.-rat. sp.-sd. 


102 0.033 0.3 0.013 0.022 0.659 0.343 1.383 23.301 MO MO 0.010 0.010 0.010 
Grid 0.2 0.013 0.031 0.657 0.519 1.571 26.668 TO MO 
ri 


3 0.3 1.156 2.187 5.633 126.441 T T T M! 
10° <0.001 he 3 ot 9 9 9 9 0.010 0.017 0.011 
0.2 1.146 2.133 5.632 161.667 TO TO TO MO 


0.1 12.909 7.969 55.788 TO TO TO MO MO 
BRP 10° 0.035 0.01 1.977 8.111 5.645 21.078 60.738 626.052 524.373 823.082 0.012 0.018 0.011 
0.005 0.604 2.261 2.709 1.429 12.171 254.000 197.940 318.840 


0.9 1.217 68.937 0.196 TO 19.765 136.491 0.630 0.468 
0.75 1.223 68.394 0.636 TO 19.782 132.780 0.602 0.467 


10 0.5 0.010 0.018 0.011 
0.52 1.228 60.024 0.739 TO 19.852 136.533 0.608 0.474 
Zero: 0.45 <0.001 0.001 0.001 <0.001 0.035 0.043 0.043 0.043 
Conk 09 MO TO 7443 TO TO TO 0.602 0.465 
0.75 M T 15.223 T T T ).599 0.470 

1 o5 OO 9 O 15:223 9 9 Os i088 0 0.037 262.193 0.031 


0.52 MO TO TO TO TO TO 0.488 0.475 
0.45 0.108 0.119 0.169 0.016 0.035 0.040 0.040 0.040 


0.9 36.083 TO 0478 TO 269.801 TO 0.938 0.686 
! 04 35.961 TO 394.955 TO 271.88 T J920 T 

Chain 10° 0.394 2” O 394.955 o pen TO. i0920 O ooo 0.014. 0.011 
0.35 101.351 TO 454.892 435.199 238.613 TO TO TO 


0.3 62.036 463.981 120.557 209.346 124.829 746.595 TO TO 


0.9 12.122 7.318 TO TO TO TO 1.878 2.053 
Double- 10° 0.215 0.3 12.120 20.424 TO TO TO TO 1.953 2.058 0.011 0.018 0.010 
Chain 0.216 12.096 19.540 TO TO TO TO 172.170 TO 


0.15 12.344 16.172 TO 16.963 TO TO TO TO 


.9 0.004 0.009 8.528 T 1.188 31.915 T M 

maia A oz 2 90 Ot o S 4 o T 0.011 0.011 1.560 
M 0.75 0.004 0.011 2.357 TO 1.209 32.143 TO 712.086 

lon- 


mege 


i .9 59.721 61. T T T T T T 
doe ay. =o, BOMED Oe 2 7 z ° n O 0.013 (t) 0.043 TO 
0.75 60.413 63.050 TO TO TO TO TO TO 


heuristics) against LT-PDR [19], PrIC3 (with four heuristics none, lin., pol., 
hyb., see [3]), and Storm 1.5 [11]. Storm is a recent comprehensive toolsuite 
that implements different algorithms and solvers. Among them, our comparison 
is against sparse-numeric, sparse-rational, and sparse-sound. The sparse engine 
uses explicit state space representation by sparse matrices; this is unlike another 
representative dd engine that uses symbolic BDDs. (We did not use dd since it 
often reported errors, and was overall slower than sparse.) Sparse-numeric is a 
value-iteration (VI) algorithm; sparse-rational solves linear (in)equations using 
rational arithmetic; sparse-sound is a sound VI algorithm [26].? 


? There are another two sound algorithms in Storm: one that utilizes interval iter- 
ation [2] and the other does optimistic VI [16]. We have excluded them from the 
results since we observed that they returned incorrect answers. 
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Table 2. Experimental results on MDP benchmarks. The legend is the same as Table 1, 
except that P is now the maximum reachability probability. 


Benchmark |S] P À AdjointPDR! Storm 


hCoB hCo01 hCoS sp.-num sp.-rat. sp.-sd. 


0.9 MO 0.172 TO 
CDrive2 38 0.865 0.75 MO 0.058 TO 0.019 0.019 0.018 
0.5 0.015 0.029 86.798 


0.9 MO 3.346 TO 
TireWorld 8670 0.233 0.75 MO 3.337 TO 
0.5 MO 6.928 TO 
0.2 4.246 24.538 TO 


0.070 0.164 0.069 


Experiments on MDPs (Table 2). We used two benchmarks from [17]. We 
compared AdjointPDR! only against Storm, since RQ] is already addressed using 
MCs (besides, PrIC3 did not run for MDPs). 


Discussion. The experimental results suggest the following answers to the RQs. 


RQ1. The performance advantage of AdjointPDR!, over both LT-PDR and 
PrIC3, was clearly observed throughout the benchmarks. AdjointPDR! out- 
performed LT-PDR, thus confirming empirically the theoretical observation in 
Sect. 4.2. The profit is particularly evident in those instances whose answer is 
positive. AdjointPDR! generally outperformed PrIC3, too. Exceptions are in 
ZeroConf, Chain and DoubleChain, where PrIC3 with polynomial (pol.) and 
hybrid (hyb.) heuristics performs well. This seems to be thanks to the expres- 
sivity of the polynomial template in PrIC3, which is a possible enhancement we 
are yet to implement (currently our symbolic heuristic hCoS uses only the affine 
template). 


RQ2. The comparison with Storm is interesting. Note first that Storm’s sparse- 
numeric algorithm is a VI algorithm that gives a guaranteed lower bound without 
guaranteed convergence. Therefore its positive answer to P <? A may not be 
correct. Indeed, for Haddad-Monmege with |S| ~ 10%, it answered P = 0.5 
which is wrong (({) in Table 1). This is in contrast with PDR algorithms that 
discovers an explicit witness for P < A via their positive chain. 

Storm’s sparse-rational algorithm is precise. It was faster than PDR algo- 
rithms in many benchmarks, although AdjointPDR! was better or comparable 
in ZeroConf (10*) and Haddad-Monmege (41), for A such that P < A is true. 
We believe this suggests a general advantage of PDR algorithms, namely to 
accelerate the search for an invariant-like witness for safety. 

Storm’s sparse-sound algorithm is a sound VI algorithm that returns cor- 
rect answers aside numerical errors. Its performance was similar to that of 
sparse-numeric, except for the two instances of Haddad-Monmege: sparse-sound 
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returned correct answers but was much slower than sparse-numeric. For these 
two instances, AdjointPDR! outperformed sparse-sound. 

It seems that a big part of Storm’s good performance is attributed to the 
sparsity of state representation. This is notable in the comparison of the two 
instances of Haddad-Monmege (41 vs. 10°): while Storm handles both of them 
easily, AdjointPDR! struggles a bit in the bigger instance. Our implementation 
can be extended to use sparse representation, too; this is future work. 


RQ3. We derived the three heuristics (hCoB, hCo01, hCoS) exploiting the theory 
of AdjointPDR!. The experiments show that each heuristic has its own strength. 
For example, hCo01 is slower than hCoB for MCs, but it is much better for MDPs. 
In general, there is no silver bullet heuristic, so coming up with a variety of them 
is important. The experiments suggest that our theory of AdjointPDR! provides 
great help in doing so. 


RQ4. Table 2 shows that AdjointPDR! can handle nondeterminism well: once a 
suitable heuristic is chosen, its performances on MDPs and on MCs of similar 
size are comparable. It is also interesting that better-performing heuristics vary, 
as we discussed above. 


Summary. AdjointPDR! clearly outperforms existing probabilistic PDR algo- 
rithms in many benchmarks. It also compares well with Storm—a highly sophis- 
ticated toolsuite—in a couple of benchmarks. These are notable especially given 
that AdjointPDR! currently lacks enhancing features such as richer symbolic 
templates and sparse representation (adding which is future work). Overall, we 
believe that AdjointPDR! confirms the potential of PDR algorithms in proba- 
bilistic model checking. Through the three heuristics, we also observed the value 
of an abstract general theory in devising heuristics in PDR, which is probably 
true of verification algorithms in general besides PDR. 
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Abstract. Quantifier elimination (qelim) is used in many automated 
reasoning tasks including program synthesis, exist-forall solving, quan- 
tified SMT, Model Checking, and solving Constrained Horn Clauses 
(CHCs). Exact qelim is computationally expensive. Hence, it is often 
approximated. For example, Z3 uses “light” pre-processing to reduce the 
number of quantified variables. CHC-solver Spacer uses model-based pro- 
jection (MBP) to under-approximate qelim relative to a given model, and 
over-approximations of qelim can be used as abstractions. 

In this paper, we present the QEL framework for fast approximations 
of qelim. QEL provides a uniform interface for both quantifier reduction 
and model-based projection. QEL builds on the egraph data structure — 
the core of the EUF decision procedure in SMT — by casting quantifier 
reduction as a problem of choosing ground (i.e., variable-free) represen- 
tatives for equivalence classes. We have used QEL to implement MBP for 
the theories of Arrays and Algebraic Data Types (ADTs). We integrated 
QEL and our new MBP in Z3 and evaluated it within several tasks that 
rely on quantifier approximations, outperforming state-of-the-art. 


1 Introduction 


Quantifier Elimination (qelim) is used in many automated reasoning tasks 
including program synthesis [18], exist-forall solving [8,9], quantified SMT [5], 
and Model Checking [17]. Complete qelim, even when possible, is computation- 
ally expensive, and solvers often approximate it. We call these approximations 
quantifier reductions, to separate them from qelim. The difference is that quan- 
tifier reduction might leave some free variables in the formula. 

For example, Z3 [19] performs quantifier reduction, called QELITE, by greed- 
ily substituting variables by definitions syntactically appearing in the formulas. 
While it is very useful, it is necessarily sensitive to the order in which variables 
are substituted and depends on definitions appearing explicitly in the formula. 
Even though it may seem that these shortcomings need to be tolerated to keep 
QELITE fast, in this paper we show that it is not actually the case; we propose 
an egraph-based algorithm, QEL, to perform fast quantifier reduction that is 
complete relative to some semantic properties of the formula. 


© The Author(s) 2023 
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 64-86, 2023. 
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Egraph [20] is a data structure that compactly represents infinitely many 
terms and their equivalence classes. It was initially proposed as a decision 
procedure for EUF [20] and used for theorem proving (e.g., SIMPLIFY [7]). 
Since then, the applications of egraphs have grown. Egraphs are now used as 
term rewrite systems in equality saturation [15,23], for theory combination 
in SMT solvers [7,21], and for term abstract domains in Abstract Interpreta- 
tion [6, 10, 12]. 

Using egraphs for rewriting or other formula manipulations (like qelim) 
requires a special operation, called extract, that converts nodes in the egraph 
back into terms. Term extraction was not considered when egraphs were first 
designed [20]. As far as we know, extraction was first studied in the application 
of egraphs for compiler optimization. Specifically, equality saturation [15,22] is 
an optimization technique over egraphs that consists in populating an egraph 
with many equivalent terms inferred by applying rules. When the egraph is sat- 
urated, i.e., applying the rules has no effect, the equivalent term that is most 
desired, e.g., smallest in size, is extracted. This is a recursive process that extracts 
each sub-term by choosing one representative among its equivalents. 

Application of egraphs to rewriting have recently resurged driven by the egg 
library [24] and the associated workshop!. In [24], the authors show, once again, 
the power and versatility of this data structure. Motivated by applications of 
equality saturation, they provide a generic and efficient framework equipped 
with term extraction, based on an extensible class analysis. 

Egraphs seem to be the perfect data-structure to address the challenges of 
quantifier reduction: they allow reasoning about infinitely many equivalent terms 
and consider all available variable definitions and orderings at once. However, 
things are not always what they appear. The key to quantifier reduction is finding 
ground (i.e., variable-free) representatives for equivalence classes with free vari- 
ables. This goes against existing techniques for term extraction since it requires 
selecting larger, rather than smaller, terms to be representatives. Selecting repre- 
sentatives carelessly makes term extraction diverge. To our surprise, this problem 
has not been studied so far. In fact, egg [24] incorrectly claims that any represen- 
tative function can be used with its term extraction, while the implementation 
diverges. In this paper, we bridge this gap by providing necessary and sufficient 
conditions for a representative function to be admissible for term extraction as 
defined in [15,24]. Furthermore, we extend extraction from terms to formulas to 
enable extracting a formula of the egraph. 

Our main contribution is a new quantifier reduction algorithm, called QEL. 
Building on the term extraction described above, it is formulated as finding a 
representative function that maximizes the number of ground terms as represen- 
tatives. Furthermore, it greedily attempts to represent variables without ground 
representatives in terms of other variables, thus further reducing the number 
of variables in the output. We show that QEL is complete relative to ground 
definitions entailed by the formula. Specifically, QEL guarantees to eliminate a 
variable if it is equivalent to a ground term. 


1 https: //pldi22.sigplan.org/series/egraphs. 
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Whenever an application requires eliminating all free variables, incomplete 
techniques such as QELITE or QEL are insufficient. In this case, gelim is under- 
approximated using a Model-based Projection (MBP) that uses a model M of a 
formula to guide under-approximation using equalities and variable definitions 
that are consistent with M. In this paper, we show that MBP can be implemented 
using our new techniques for QEL together with the machinery from equality 
saturation. Just like SMT solvers use egraphs as glue to combine different theory 
solvers, we use egraphs as glue to combine projection for different theories. In 
particular, we give an algorithm for MBP in the combined theory of Arrays and 
Algebraic DataTypes (ADTs). The algorithm uses insights from QEL to produce 
less under-approximate MBPs. 

We implemented QEL and the new MBP using egraphs inside the state-of 
art SMT solver Z3 [19]. Our implementation (referred to as Z3EG) replaces the 
existing QELITE and MBP. We evaluate our algorithms in two contexts. First, 
inside the QSAT [5] algorithm for quantified satisfiability. The performance of 
QSAT in Z3EG is improved, compared to QSAT in Z3, when ADTs are involved. 
Second, we evaluate our algorithms inside the Constrained Horn Clause (CHC) 
solver SPACER [17]. Our experiments show that SPACER in Z3EG solves many 
more benchmarks containing nested Arrays and ADTs. 


Related Work. Quantifier reduction by variable substitution is widely used in 
quantified SMT [5,11]. To our knowledge, we are the first to look at this prob- 
lem semantically and provide an algorithm that guarantees that the variable is 
eliminated if the formula entails that it has a ground definition. 

Term extraction for egraphs comes from equality saturation [15,22]. The 
egg Rust library [24] is a recent implementation of equality saturation that 
supports rewriting and term extraction. However, we did not use egg because 
we integrated QEL within Z3 and built it using Z3 data structures instead. 

Model-based projection was first introduced for the SPACER CHC solver for 
LIA and LRA [17] and extended to the theory of Arrays [16] and ADTs [5]. Until 
now, it was implemented by syntactic rewriting. Our egraph-based MBP imple- 
mentation is less sensitive to syntax and, more importantly, allows for combining 
MBPs of multiple theories for MBP of the combination. As a result, our MBP 
is more general and less model dependent. Specifically, it requires fewer model 
equalities and produces more general under-approximations than [5, 16]. 


Outline. The rest of the paper is organized as follows. Section 2 provides back- 
ground. Section 3 introduces term extraction, extends it to formulas, and char- 
acterizes representative-based term extraction for egraphs. Section 4 presents 
QEL, our algorithm for fast quantifier reduction that is relatively complete. 
Section 5 shows how to compute MBP combining equality saturation and the 
ideas from Sect. 4 for the theories of ADTs and Arrays. All algorithms have been 
implemented in Z3 and evaluated in Sect. 6. 
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2 Background 


We assume the reader is familiar with multi-sorted first-order logic (FOL) with 
equality and the theory of equality with uninterpreted functions (EUF) (for an 
introduction see, e.g. [4]). We use ~ to denote the designated logical equality 
symbol. For simplicity of presentation, we assume that the FOL signature X 
contains only functions (i.e., no predicates) and constants (i.e., 0-ary functions). 
To represent predicates, we assume the FOL signature has a designated sort 
Bool, and two Bool constants T and L, representing true, and false respectively. 
We then use Bool-valued functions to represent predicates, using P(a)~ T and 
P(a)* L to mean that P(a) is true or false, respectively. Informally, we continue 
to write P(a) and —P(a) as a syntactic sugar for P(a) ~ T and P(a) ~ L, respec- 
tively. We use lowercase letters like a, b for constants, and f, g for functions, 
and uppercase letters like P, Q for Bool functions that represent predicates. We 
denote by y? the existential closure of w. 


Quantifier Elimination (qelim). Given a quantifier-free (QF) formula y with 
free variables v, quantifier elimination of y= is the problem of finding a QF 
formula ~ with no free variables such that w = y?. For example, a qelim of 
Ja. (axa A f(a) > 3) is f(x) > 3; and, there is no qelim of Ja - (f(x) > 3), 
because it is impossible to restrict f to have “at least one value in its range that 
is greater than 3” without a quantifier. 


Model Based Projection (MBP). Let p be a formula with free variables v, and 
M a model of y. A model-based projection of y relative to M is a QF formula 
w such that Y => p? and M | y. That is, w has no free variables, is an under- 
approximation of p, and satisfies the designated model M, just like y. MBP is 
used by many algorithms to under-approximate qelim, when the computation of 
qelim is too expensive or, for some reason, undesirable. 


Egraphs. An egraph is a well-known data structure to compactly represent a set 
of terms and an equivalence relation on those terms [20]. Throughout the paper, 
we assume that graphs have an ordered successor relation and use n/[i] to denote 
the ith successor (child) of a node n. An out-degree of a node n, deg(n), is the 
number of edges leaving n. Given a node n, parents(n) denotes the set of nodes 
with an outgoing edge to n and children(n) denotes the set of nodes with an 
incoming edge from n. 


Definition 1. Let X be a first-order logic signature. An egraph is a tuple G = 
(N,E,L,root), where 


(a) (N, E) is a directed acyclic graph, 

(b) L maps nodes to function symbols in X or logical variables, and 

(c) root : N ++ N maps a node to its root such that the relation proot = 
{(n, n’) | root(n) = root(n’)} is an equivalence relation on N that is closed 
under congruence: (n, n’) E€ Proot whenever n and n’! are congruent under 
root, i.e., whenever L(n) = L(n’), deg(n) = deg(n’) > 0, and, Y1 < i < 
deg(n) - (nfil, nil) € proot- 
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Fig. 1. Example egraph of 1. 


Given an egraph G, the class of a node n € G, class(n) = Proot(n), is the set 
of all nodes that are equivalent to n. The term of n, term(n), with L(n) = f is 
f if deg(n) = 0 and f(term(n[1]),..., term(n[deg(n)])), otherwise. We assume 
that the terms of different nodes are different, and refer to a node n by its term. 

An example of an egraph G = (N, E, L, root) is shown in Fig. 1. A symbol f 
inside a circle depicts a node n with label L(n) = f, solid black and dashed red 
arrows depict E and root, respectively. The order of the black arrows from left 
to right defines the order of the children. In our examples, we refer to a specific 
node i by its number using N(i) or its term, e.g., N(k +1). A node n without an 
outgoing red arrow is its own root. A set of nodes connected to the same node 
with red edges forms an equivalence class. In this example, root defines the 
equivalence classes {N(3),N(4),N(5),N(6)}, {N(8),N(9)}, and a class for each 
of the remaining nodes. Examples of some terms in G are term(N(9)) = y and 
term(N(5)) = read(a, y). 


An Egraph of a Formula. We consider formulas that are conjunctions of equal- 
ity literals (recall that we represent predicate applications by equality literals). 
Given a formula y £ (tı ui A---At, ug), an egraph from ¢ is built (follow- 
ing the standard procedure [20]) by creating nodes for each t; and u;, recursively 
creating nodes for their subexpressions, and merging the classes of each pair t; 
and u;i, computing the congruence closure for root. We write egraph(y) for an 
egraph of y constructed via some deterministic procedure based on the recipe 
above. Figure 1 shows an egraph(y,) of pı. The equality z ~ read(a,x) is cap- 
tured by N(3) and N(4) belonging to the same class (i.e., red arrow from N(4) to 
N(3)). Similarly, the equality x ~ y is captured by a red arrow from N(9) to N(8). 
Note that by congruence, yı implies read(a, x) ~ read(a, y), which, by transitiv- 
ity, implies that k+ 1 ~ read(a, x). In Fig. 1, this corresponds to red arrows from 
N(5) and N(6) to N(3). The predicate application 3 > z is captured by the red 
arrow from N(1) to N(0). From now on, we omit T and L and the corresponding 
edges from figures to avoid clutter. 


Explicit and Implicit Equality. Note that egraphs represent equality implicitly 
by placing nodes with equal terms in the same equivalence class. Sometimes, it 
is necessary to represent equality explicitly, for example, when using egraphs for 
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yo(x,y) = eq(c, f(w)) A eq(d, f(y) A eq(x,y) 


Fig. 2. Different egraph interpretations for yo. 


equality-aware rewriting (e.g., in egg [24]). To represent equality explicitly, we 
introduce a binary Bool function eq and write eq(a,b) for an equality that has 
to be represented explicitly. We change the egraph algorithm to treat eq(a, b) as 
both a function application, and as a logical equality a ~ b: when processing term 
eq(a, b), the algorithm both adds eq(a, b) to the egraph, and merges the nodes for 
a and b into one class. For example, Fig. 2 shows three different interpretations 
of a formula y2 with equality interpreted: implicitly (as in [20]), explicitly (as 
in [24]), and both implicitly and explicitly (as in this paper). 


3 Extracting Formulas from Egraphs 


Egraphs were proposed as a decision procedure for EUF [20] — a setting in 
which converting an egraph back to a formula, or extracting, is irrelevant. Term 
extraction has been studied in the context of equality saturation and term rewrit- 
ing [15,24]. However, existing literature presents extraction as a heuristic, and, 
to the best of our knowledge, has not been exhaustively explored. In this section, 
we fill these gaps in the literature and extend extraction from terms to formulas. 


Term Extraction. We begin by recalling how to extract the term of a node. 
The function ntt (node-to-term) in Fig. 3 does an extraction parametrized by a 
representative function repr : N + N (same as in [24]). A function repr assigns 
each class a unique representative node (i.e., nodes in the same class are mapped 
to the same representative) so that proot = Prepr- The function ntt extracts a 
term of a node recursively, similarly to term, except that the representatives of 
the children of a node are used instead of the actual children. We refer to terms 
built in this way by ntt(n, repr) and omit repr when it is clear from the context. 
As an example, consider repr, = {N(3),N(8))} for Fig. 1. For readability, we 
denote representative functions by sets of nodes that are the class representatives, 
omitting N(T) that always represents its class, and omitting all singleton classes. 
Thus, repr, maps all nodes in class(N(3)) to N(3), nodes in class(N(8)) to 
N(8), nodes in class(N(T)) to N(7), and all singleton classes to themselves. For 
example, ntt(N(5)) extracts read(a,x), since N(9) has as representative N(8). 
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egraph :: to_formula(repr, S) egraph :: ntt(n, repr) 
1: Lits = i) 8: f — Lin] 
2: for r = repr(r) € N do 9: if deg(n) = 0 then 
3: t:=ntt(r,repr) 10: ret f 
4: forn € (class(r) \ r) do 11: else 
5: ifn ¢ S then 12: for i € [1,deg(n)] do 
6 Lits := Lits U {t~ ntt(n, repr) } 13:  Args{i] := ntt(repr(nļi]), repr) 
7: ret N Lits 14: ret f(Args) 


Fig. 3. Producing formulas from an egraph. 


Formula Extraction. Let G = egraph(p) be an egraph of some formula y. A 
formula 7 is a formula of G, written isFormula(G,w), if Y? = oF. 

Figure3 shows an algorithm to_formula(repr, S) to compute a formula 
w that satisfies isFormula(G,w) for a given egraph G. In addition to repr, 
to_formula is parameterized by a set of nodes S C N to exclude?. To pro- 
duce the equalities corresponding to the classes, for each representative r, for 
each n € (class(r) \ {r}) the output formula has a literal ntt(r)~ntt(n). For 
example, using repr, for the egraph in Fig. 1, we obtain for class(N(8)), (x ~y); 
for class(N(3)), (z read(a, x) A z% read(a,x) \ z*%k+1); and for class(N(0)), 
(T #3 > z). The final result (slightly simplified) is: z ~ y^ z ~ read(a,x)Az=k+ 
LA3>z. 

Let G = egraph(y) for some formula y. Note that, ~ computed by 
to formula is not syntactically the same as y. That is, to_ formula is not 
an inverse of egraph. Furthermore, since to_ formula commits to one represen- 
tative per class, it is limited in what formulas it can generate. For example, since 
xy is in yj, for any repr, yı cannot be the result of to_ formula, because 
the output can contain only one of read(a, x) or read(a, y). 


Representative Functions. The representative function is instrumental for deter- 
mining the terms that appear in the extracted formula. To illustrate the impor- 
tance of representative choice, consider the formula y, of Fig. 4 and its egraph 
G4 = egraph(p4). For now, ignore the blue dotted lines. For repr,,, to_ formula 
obtains Ya = (4 g(6) A f(x) 6 A yX6). For repr,,, to_ formula produces 
wy = (g(6) œx A f(g(6)) 6A yx 6). In some applications (like qelim considered 
in this paper) y» is preferred to Ya: simply removing the literals g(6) +a and 
y 6 from w, results in a formula equivalent to Jx, y: p4 that does not contain 
variables. Consider a third representative choice repr,., for node N(1), ntt does 
not terminate: to produce a term for N(1), a term for N(3), the representative 
of its child, N(2), is required. Similarly to produce a term for N(3), a term for 
the representative of its child node N(5), N(1), is necessary. Thus, none of the 
terms can be extracted with repr,.. 

For extraction, representative functions repr are either provided explicitly or 
implicitly (as in [24]), the latter by associating a cost to nodes and/or terms and 


? The set S affects the result, but for this section, we restrict to the case of S £ @. 
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(a) repr,, = {N(4),N(5)}. (b) repra, = {N(4), N(1)}. (c) reprae = {N(3), N(1)}. 


Fig. 4. Egraphs of y4 with Grepr (Color figure online). 


letting the representative be a node with minimal cost. However, observe that 
not all costs guarantee that the chosen repr can be used (the computation does 
not terminate). For example, the ill-defined repr,, from above is a representative 
function that satisfies the cost function that assigns function applications cost 0 
and variables and constants cost 1. A commonly used cost function is term AST 
size, which is sufficient to ensure termination of ntt(n, repr). 

We are thus interested in characterizing representative functions motivated 
by two observations: not every cost function guarantees that ntt(n) terminates; 
and the kind of representative choices that are most suitable for qelim (repr,,) 
cannot be expressed over term AST size. 


Definition 2. Given an egraph G = (N,E,L,root), a representative function 
repr: N — N is admissible for G if 


(a) repr assigns a unique representative per class, 

(b) Proot = Prepr> and 

(c) the graph Grepr is acyclic, where Grepr = (N, FErepr) and Erepr £ 
{(n,repr(c)) |c E€ children(n),n € N}. 


Dotted blue edges in the graphs of Fig. 4 show the corresponding Grepr- 
Intuitively, for each node n, all reachable nodes in Grepr are the nodes whose 
ntt term is necessary to produce the ntt(n). Observe that Gyepr, has a cycle, 
thus, repr, is not admissible. 


Theorem 1. Given an egraph G and a representative function repr, the func- 
tion G.to_ formula(repr,Q)) terminates with result y such that isFormula(G, w) 
iff repr is admissible for G. 


To the best of our knowledge, Theorem 1 is the first complete characterization 
of all terms of a node that can be obtained by extraction based on class repre- 
sentatives (via describing all admissible repr, note that the number is finite). 
This result contradicts [24], where it is claimed to be possible to extract a term 
of a node for any cost function. The counterexample is repr,,. Importantly, this 
characterization allows us to explore representative functions outside those in 
the existing literature, which, as we show in the next section, is key for qelim. 
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Input: A formula y with free variables v. 
Output: A quantifier reduction of yp. 
QEL(y, v) 

1: G := egraph(y) 

2: repr := G.find_defs(v) 

3: repr :=G.refine defs(repr,v) 

4: core := G.find_core(repr) 

5: ret G.to_formula(repr, G. Nodes() \ core) 


Algorithm 1: QEL — Quantifier reduction using egraphs. 


4 Quantifier Reduction 


Quantifier reduction is a relaxation of quantifier elimination: given two formulas 
y and w with free variables v and u, respectively, w is a quantifier reduction of 
yifuCv and g? = y7. If u is empty, then y is a quantifier elimination of y?. 
Note that quantifier reduction is possible even when quantifier elimination is not 
(e.g., for EUF). We are interested in an efficient quantifier reduction algorithm 
(that can be used as pre-processing for qelim), even if a complete qelim is possible 
(e.g., for LIA). In this section, we present such an algorithm called QEL. 

Intuitively, QEL is based on the well-known substitution rule: (Jx-x = t^g) = 
|x + t]. A naive implementation of this rule, called QELITE in Z3, looks for syn- 
tactic definitions of the form x ~t for a variable x and an x-free term t and sub- 
stitutes x with t. While efficient, QELITE is limited because of: (a) dependence 
on syntactic equality in the formula (specifically, it misses implicit equalities due 
to transitivity and congruence); (b) sensitivity to the order in which variables are 
eliminated (eliminating one variable may affect available syntactic equalities for 
another); and (c) difficulty in dealing with circular equalities such as x ~ f(x). 

For example, consider the formula y4(x, y) in Fig. 4. Assume that y is elimi- 
nated first using y= f(x), resulting in «+ g(f(a)) A f(a) +6. Now, x cannot be 
eliminated since the only equality for æ is circular. Alternatively, assume that 
QELITE somehow noticed that by transitivity, p4 implies y %6, and obtains 
(dy - y4) £ zæg(6) A f(x) 6. This time, 2 g(6) can be used to obtain 
f(g(6)) +6 that is a qelim of y7. Thus, both the elimination order and implicit 
equalities are crucial. 

In QEL, we address the above issues by using an egraph data structure to 
concisely capture all implicit equalities and terms. Furthermore, egraphs allow 
eliminating multiple variables together, ensuring that a variable is eliminated if 
it is equivalent (explicitly or implicitly) to a ground term in the egraph. 

Pseudocode for QEL is shown in Algorithm 1. Given an input formula y, QEL 
first builds its egraph G (line 1). Then, it finds a representative function repr 
that maps variables to equivalent ground terms, as much as possible (line 2). 
Next, it further reduces the remaining free variables by refining repr to map 
each variable x to an equivalent x-free (but not variable-free) term (line 3). 
At this point, QEL is committed to the variables to eliminate. To produce the 
output, find_core identifies the subset of the nodes of G, which we call core, 
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(a) repr;, ={N(1), N(4), N(5)} |(b) repr, ={N(3), N(6), N(5)} |(c) reprs.={N(1), N(6), N(5)} 


Fig. 5. Egraphs including Grepr (Color figure online) of 5. 


that must be considered in the output (line 4). Finally, to_ formula converts 
the core of G to the resulting formula (line 5). We show that the combination of 
these steps is even stronger than variable substitution. 

To illustrate QEL, we apply it on yı and its egraph G from Fig. 1. The func- 
tion find_defs returns repr = {N(6), N(8)}°. Node N(6) is the only node with 
a ground term in the equivalence class class(N(3)). This corresponds to the defi- 
nition z~k+1. Node N(8) is chosen arbitrarily since class(N(8)) has no ground 
terms. There is no refinement possible, so refine_defs returns repr. The core 
is N \ {N(3), N(5), N(9)}. Nodes N(3) and N(9) are omitted because they corre- 
spond to variables with definitions (under repr), and N(5) is omitted because 
it is congruent to N(4) so only one of them is needed. Finally, to _ formula 
produces k + 1% read(a,x) \3 > k +1. Variables z and y are eliminated. 

In the rest of this section we present QEL in detail and QEL’s key properties. 


Finding Ground Definitions. Ground variable definitions are found by selecting 
a representative function repr that ensures that the maximum number of terms 
in the formula are rewritten into ground equivalent ones, which, in turn, means 
finding a ground definition for all variables that have one. 

Computing a representative function repr that is admissible and ensures 
finding ground definitions when they exist is not trivial. Naive approaches for 
identifying ground terms, such as iterating arbitrarily over the classes and select- 
ing a representative based on term(n) are not enough — term(n) may not be in 
the output formula. It is also not possible to make a choice based on ntt(n), 
since, in general, it cannot be yet computed (repr is not known yet). 

Admissibility raises an additional challenge since choosing a node that 
appears to be a definition (e.g., not a leaf) may cause cycles in Gyepr. For exam- 
ple, consider ys of Fig.5. Assume that N(1) and N(4) are chosen as representa- 
tives of their equivalence classes. At this point, Gyepr has two edges: (N(5), N(4)) 
and (N(2),N(1)), shown by blue dotted lines in Fig. 5a. Next, if either N(2) or 
N(5) are chosen as representatives (the only choices in their class), then Gyepr 


3 Recall that we only show representatives of non-singleton classes. 
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egraph :: find_defs(v) egraph :: process(repr, todo) 

1: for n E€ N do repr(n) := x 7: while todo 4 do 

2: todo := {leaf (n) | n € N A ground (n)} 8: n := todo.pop() 

3: repr := process(repr, todo) 9: if repr(n) # x then continue 

4: todo := {leaf (n) | n € N} 10: for n’ € class(n) do repr(n’) := n 

5: repr := process(repr, todo) 11: for n’ € class(n) do 

6: ret repr 12: for p € parents(n’) do 
13: if Vc € children(p) - repr(c) Æ x then 
14: todo.push(p) 


15: ret repr 


Algorithm 2: Find definitions maximizing groundness. 


becomes cyclic (shown in blue in Fig.5a). Furthermore, backtracking on repre- 
sentative choices needs to be avoided if we are to find a representative function 
efficiently. 

Algorithm 2 finds a representative function repr while overcoming these 
challenges. To ensure that the computed representative function is admissible 
(without backtracking), Algorithm 2 selects representatives for each class using 
a “bottom up” approach. Namely, leaves cannot be part of cycles in Grepr because 
they have no outgoing edges. Thus, they can always be safely chosen as repre- 
sentatives. Similarly, a node whose children have already been assigned repre- 
sentatives in this way (leaves initially), will also never be part of a cycle in Grepr. 
Therefore, these nodes are also safe to be chosen as representatives. 

This intuition is implemented in find_defs by initializing repr to be unde- 
fined (xr) for all nodes, and maintaining a workset, todo, containing nodes that, if 
chosen for the remaining classes (under the current selection), maintain acyclic- 
ity Of Gyepr. The initialization of todo includes leaves only. The specific choice 
of leaves ensures that ground definitions are preferred, and we return to it later. 
After initialization, the function process extracts an element from todo and sets 
it as the representative of its class if the class has not been assigned yet (lines 9 
and 10). Once a class representative has been chosen, on lines 11 to 14, the par- 
ents of all the nodes in the class such that all the children have been chosen (the 
condition on line 13) are added to todo. 

So far, we discussed how admissibility of repr is guaranteed. To also ensure 
that ground definitions are found whenever possible, we observe that a similar 
bottom up approach identifies terms that can be rewritten into ground ones. 
This builds on the notion of constructively ground nodes, defined next. 

A class cis ground if c contains a constructively ground, or c-ground for short, 
node n, where a node n is c-ground if either (a) term(n) is ground, or (b) n is 
not a leaf and the class class(n|i]) of every child nfi] is ground. Note that nodes 
labeled by variables are never c-ground. 

In the example in Fig. 1, class(N(7)) and class(N(8)) are not ground, because 
all their nodes represent variables; class(N(6)) is ground because N(6) is c- 
ground. Nodes N(4) and N(5) are not c-ground because the class of N(8) (a 
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child of both nodes) is not ground. Interestingly, N(1) is c-ground, because 
class(N(3)) = class(N(6)) is ground, even though its term 3 > z is not ground. 

Ground classes and c-ground nodes are of interest because whenever y H 
term(n) ~t for some node n and ground term t, then class(n) is ground, i.e. 
it contains a c-ground node, where c-ground nodes can be found recursively 
starting from ground leaves. Furthermore, the recursive definition ensures that 
when the aforementioned c-ground nodes are selected as representatives, the 
corresponding terms w.r.t. repr are ground. 

As a result, to maximize the ground definitions found, we are interested in 
finding an admissible representative function repr that is maximally ground, 
which means that for every node n € N, if class(n) is ground, then repr(n) is 
c-ground. That means that c-ground nodes are always chosen if they exist. 


Theorem 2. Let G = egraph(y) be an egraph and repr an admissible represen- 
tative function that is maximally ground. For all n € N, if p = term(n) xt for 
some ground term t, then repr(n) is c-ground and ntt(repr(n)) is ground. 


We note that not every choice of c-ground nodes as representatives results in 
an admissible representative function. For example, consider the formula y, of 
Fig. 4 and its egraph. All nodes except for N(5) and N(2) are c-ground. However, 
a repr with N(3) and N(1) as representatives is not admissible. Intuitively, this 
is because the “witness” for c-groundness of N(1) in class(N(2)) is N(4) and 
not N(3). Therefore, it is important to incorporate the selection of c-ground 
representatives into the bottom up procedure that ensures admissibility of repr. 

To promote c-ground nodes over non c-ground in the construction of an 
admissible representative function, find_defs chooses representatives in two 
steps. First, only the ground leaves are processed (line 2). This ensures that 
c-ground representatives are chosen while guaranteeing the absence of cycles. 
Then, the remaining leaves are added to todo (line 4). This triggers representative 
selection of the remaining classes (those that are not ground). 

We illustrate find_defs with two examples. For yy of Fig. 4, there is only one 
leaf that is ground, N(4), which is added to todo on line 2, and todo is processed. 
N(4) is chosen as representative and, as a consequence, its parent N(1) is added 
to todo. N(1) is chosen as representative so N(3), even though added to the queue 
later, is not chosen as representative, obtaining repr, = {N(4), N(1)}. For ys of 
Fig. 5, no nodes are added to todo on line 2. N(3) and N(6) are added on line 4. 
In process, both are chosen as representatives obtaining, reprsp. 

Algorithm 2 guarantees that repr is maximally ground. Together with The- 
orem 2, this implies that all terms that can be rewritten into ground equivalent 
ones will be rewritten, which, in turn, means that for each variable that has a 
ground definition, its representative is one such definition. 


Finding Additional (Non-ground) Definitions. At this point, QEL found ground 
definitions while avoiding cycles in Grepr. However, this does not mean that as 
many variables as possible are eliminated. A variable can also be eliminated if 
it can be expressed as a function of other variables. This is not achieved by 
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egraph :: refine defs(repr, v) egraph :: find_core(repr, v) 
1: for n € N do 1: core := 0) 
2: ifn =repr(n) and L(n) € v then 2: forn € N s.t. n = repr(n) do 
3:0 isn 3: core := core U {n} 

4: for n’ € class(n) \ {n} do 4: for n’ € (class(n) \ n) do 
5: if L(n’) g v then 5: if L(n’) € v then continue 
6: if not cycle(n’, repr) then 6 else if Im € core -m congruent with n’ 
T: r:=n'; then 

8: break 7: continue 

9: for n’ € class(n) do 8: core := core U {n’} 

10: repr[n’] := r 9: ret core 


11: ret repr 


Algorithm 3: Refining repr and building core. 


find_defs. For example, in repr;, both variables are representatives, hence 
none is eliminated, even though, since z ~ g(f(y)), x could be eliminated in f5 
by rewriting x as a function of y, allowing to eliminate x by rewriting it as a 
function of y, g(f(y)). Algorithm 3 shows function refine defs that refines 
maximally ground reprs to further find such definitions while keeping admissi- 
bility and ground maximality. This is done by greedily attempting to change class 
representatives if they are labeled with a variable. refine defs iterates over 
the nodes in the class checking if there is a different node that is not a variable 
and that does not create a cycle in Grepr (line 6). The resulting repr remains 
maximally ground because representatives of ground classes are not changed. 

For example, let us refine repr;, = {N(3),N(6),N(5)} obtained for gs. 
Assume that x is processed first. For class(N(«)), changing the representative 
to N(1) does not introduce a cycle (see Fig. 5c), so N(1) is selected. Next, for 
class(N(y)), choosing N(4) causes Grepr to be cyclic since N(1) was already cho- 
sen (Fig. 5a), so the representative of class(N(y)) is not changed. The final refine- 
ment is repr;, = {N(1), N(6), N(5)}. 

At this point, QEL found a representative function repr with as many ground 
definitions as possible and attempted to refine repr to have fewer variables as 
representatives. Next, QEL finds a core of the nodes of the egraph, based on 
repr, that will govern the translation of the egraph to a formula. While repr 
determines the semantic rewrites of terms that enable variable elimination, it is 
the use of the core in the translation that actually eliminates them. 


Variable Elimination Based on a Core. A core of an egraph G = (N, E, L, root) 
and a representative function repr, is a subset of the nodes Ne C N such that 
We = G.to_formula(repr, N \ Ne) satisfies isFormula(G, Ye). 

Algorithm 3 shows pseudocode for find_ core that computes a core of an 
egraph for a given representative function. The idea is that non-representative 
nodes that are labeled by variables, as well as nodes congruent to nodes that 
are already in the core, need not be included in the core. The former are not 
needed since we are only interested in preserving the existential closure of the 
output, while the latter are not needed since congruent nodes introduce the same 
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syntactic terms in the output. For example, for pı and repr,, find_ core returns 
core; = N1 \{N(3), N(5), N(9)}. Nodes N(3) and N(9) are excluded because they 
are labeled with variables; and node N(5) because it is congruent with N(4). 

Finally, QEL produces a quantifier reduction by applying to_ formula with 
the computed repr and core. Variables that are not in the core (they are not 
representatives) are eliminated — this includes variables that have a ground defi- 
nition. However, QEL may eliminate a variable even if it is a representative (and 
thus it is in the core). As an example, consider y(x, y) = f(x) œ fly) Arey, 
whose egraph G contains 2 classes with 2 nodes each. The core Ne relative to 
any admissible repr contains only one representative per class: in the class(N(x)) 
because both nodes are labeled with variables, and in the class(N(f(a))) because 
nodes are congruent. In this case, to_ formula(repr, Ne) results in T (since sin- 
gleton classes in the core produce no literals in the output formula), a quantifier 
elimination of w. More generally, the variables are eliminated because none of 
them is reachable in Grepr from a non-singleton class in the core (only such 
classes contribute literals to the output). 

We conclude the presentation of QEL by showing its output for our exam- 
ples. For y1, QEL obtains (k+ 1 ~ read(a, x) A3 > k+1), a quantifier reduction, 
using repr, = {N(3),N(8))} and core; = N1 \ {N(3),N(5), N(9)}. For p4, QEL 
obtains (6% f(g(6))), a quantifier elimination, using repra, = {N(4),N(1)}, 
and coreg, = N4 \ {N(3),N(2)}. Finally, for ys, QEL obtains (y= h(f(y)) A 
f(g(f(y))) = fly)), a quantifier reduction, using repr;, = {N(1),N(6),N(5)} 
and cores. = Ns \ {N(3)}. 


Guarantees of QEL. Correctness of QEL is straightforward. We conclude this 
section by providing two conditions that ensure that a variable is eliminated by 
QEL. The first condition guarantees that a variable is eliminated whenever a 
ground definition for it exists (regardless of the specific representative function 
and core computed by QEL). This makes QEL complete relative to quantifier 
elimination based on ground definitions. Relative completeness is an important 
property since it means that QEL is unaffected by variable orderings and syn- 
tactic rewrites, unlike QELITE. The second condition, illustrated by w above, 
depends on the specific representative function and core computed by QEL. 


Theorem 3. Let y be a QF conjunction of literals with free variables v, and let 
v E€ v. Let G = egraph(p), ny the node in G such that L(n,) =v and repr and 
core computed by QEL. We denote by NS = {n € core | (class(n) N core) # 
{n}} the set of nodes from classes with two or more nodes in core. If one of the 
following conditions hold, then v does not appear in QEL(y, v): 


(1) there exists a ground term t s.t. p EF ust, or 
(2) ny is not reachable from any node in NS in Grepr- 


As a corollary, if every variable meets one of the two conditions, then QEL finds 
a quantifier elimination. 

This concludes the presentation of our quantifier reduction algorithm. Next, 
we show how QEL can be used to under-approximate quantifier elimination, 
which allows working with formulas for which QEL does not result in a qelim. 
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ElimWr Rd 
: function match(t) 


ja 


2: ret t = read(write(s,i, v), j) 
ELıMWRRD1 , 3: function apply(t, M, G) 
p[read(write(t, i, v), j)] Mian 4. if M |ļHixj then 
plu] Aixi i E 5:  G.assert(i = j) 
6  G.assert(t¥v) 
ELIMWRRD2 7: else 
plread(writelt, iv) DI ME igs 8: G.assert(i z j) 
plread(t, j] Ai Æj 9:  G.assert(t ~ read(s,j)) 
Fig. 6. Two MBP rules from [16]. The Fig. 7. Adaptation of rules in Fig. 6 
notation y[t] means that y contains using QEL API. 


term t. The rules rewrite all occur- 
rences of read(write(t,i,v),j) with v 
and read(t, j), respectively. 


5 Model Based Projection Using QEL 


Applications like model checking and quantified satisfiability require efficient 
computation of under-approximations of quantifier elimination. They make use 
of model-based projection (MBP) algorithms to project variables that cannot be 
eliminated cheaply. Our QEL algorithm is efficient and relatively complete, but it 
does not guarantee to eliminate all variables. In this section, we use a model and 
theory-specific projection rules to implement an MBP algorithm on top of QEL. 

We focus on two important theories: Arrays and Algebraic DataTypes (ADT). 
They are widely used to encode program verification tasks. Prior works separately 
develop MBP algorithms for Arrays [16] and ADTs [5]. Both MBPs were presented 
as a set of syntactic rewrite rules applied until fixed point. 

Combining the MBP algorithms for Arrays and ADTs is non-trivial because 
applying projection rules for one theory may produce terms of the other theory. 
Therefore, separately achieving saturation in either theory is not sufficient to 
reach saturation in the combined setting. The MBP for the combined setting 
has to call both MBPs, check whether either one of them produced terms that 
can be processed by the other, and, if so, call the other algorithm. This is similar 
to theory combination in SMT solving where the core SMT solver has to keep 
track of different theory solvers and exchange terms between them. 

Our main insight is that egraphs can be used as a glue to combine MBP 
algorithms for different theories, just like egraphs are used in SMT solvers to 
combine satisfiability checking for different theories. Implementing MBP using 
egraphs allows us to use the insights from QEL to combine MBP with on-the-fly 
quantifier reduction to produce less under-approximate formulas than what we 
get by syntactic application of MBP rules. 
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To implement MBP using egraphs, we implement all rewrite rules for MBP in 
Arrays [16] and ADTs [5] on top of egraphs. In the interest of space, we explain 
the implementation of just a couple of the MBP rules for Arrays’. 

Figure6 shows two Array MBP rules from [16]: ELIMWRRD1 and 
ELIMWRRD2. Here, ọ is a formula with arrays and M is a model for y. Both 
rules rewrite terms which match the pattern read(write(t,i,v), j), where t, i, j, k 
are all terms and t contains a variable to be projected. ELIMWRRD1 is applicable 
when M — i~ j. It rewrites the term read(write(t,i,v),j) to v. ELIMWRRD2 
is applicable when M | i~ j and rewrites read(write(t,i,v),7) to read(t, j). 

Figure 7 shows the egraph implementation of ELIMWRRD1 and ELIMWRRD2. 
The match(t) method checks if t syntactically matches read(write(s,i,v), j), where 
s contains a variable to be projected. The apply(t) method assumes that ¢ is 
read(write(s,i,v),7). It first checks if M H i~ j, and, if so, it adds i ~ j and t ~ v 
to the egraph G. Otherwise, if M A i~ j, apply(t) adds a disequality i% j and 
an equality t~ read(s,v) to G. That is, the egraph implementation of the rules 
only adds (and does not remove) literals that capture the side condition and the 
conclusion of the rule. 

Our algorithm for MBP based on egraphs, MBP-QEL, is shown in Alg. 4. 
It initializes an egraph with the input formula (line 1), applies MBP rules until 
saturation (line 4), and then uses the steps of QEL (lines 7-12) to generate the 
projected formula. 

Applying rules is as straightforward as iterating over all terms t in the egraph, 
and for each rule r such that r.match(t) is true, calling r.apply(t, M, G) (lines 14- 
22). As opposed to the standard approach based on formula rewriting, here the 
terms are not rewritten — both remain. Therefore, it is possible to get into an 
infinite loop by re-applying the same rules on the same terms over and over again. 
To avoid this, MBP-QEL marks terms as seen (line 23) and avoids them in the 
next iteration (line 15). Some rules in MBP are applied to pairs of terms. For 
example, ACKERMANN rewrites pairs of read terms over the same variable. This 
is different from usual applications where rewrite rules are applied to individual 
expressions. Yet, it is easy to adapt such pairwise rewrite rules to egraphs by 
iterating over pairs of terms (lines 25-30). 

MBP-QEL does not apply MBP rules to terms that contain variables but 
are already c-ground (line 16), which is sound because such terms are replaced by 
ground terms in the output (Theorem 3). This prevents unnecessary application 
of MBP rules thus allowing MBP-QEL to compute MBPs that are closer to a 
quantifier elimination (less model-specific). 

Just like each application of a rewrite rule introduces a new term to a formula, 
each call to the apply method of a rule adds new terms to the egraph. Therefore, 
each call to ApplyRules (line 4) makes the egraph bigger. However, provided 
that the original MBP combination is terminating, the iterative application of 
ApplyRules terminates as well (due to marking). 

Some MBP rules introduce new variables to the formula. MBP-QEL com- 
putes repr based on both original and newly introduced variables (line 7). This 


t Implementation of all other rules is similar. 
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Input: A QF formula y with free variables v all of sort Array(I,V) or ADT, a model 
M | 7, and sets of rules ArrayRules and ADTRules 
Output: A cube 7) s.t. Y? > p7, M Ew, and vars(w) are not Arrays or ADTs 


MBP-QEL(y, v, M) ApplyRules(G, M, R, S, Sp) 


1: G := egraph(p) 3: progress := L 

2: pı; p2 := T, T; S, Sp := 0,0 4: N := G.Nodes() 

3: while pı V p2 do 5: U:={n|neN\ S} 

4: pı := ApplyRules(G, M, ArrayRules, S, Sp) 6: T := {term{n)|n €E U ^ 

5: po := ApplyRules(G, M, ADTRules, S, Sp) (is_eq(term(n)) V c-ground(n))} 
6: v’ := G. Vars() T: Rp:={r € R | r.is_for_pairs()} 
7: repr := G.find_defs(v’) 8: Ru := R\ Rp 

8: repr := G.refine_defs(repr, v’) 9: for each t € T,r € Ru do 

9: core := G.find_core(repr, v’) 20: if r.match(t) then 

10: Ve := {v E v' | is_arr(v) V is_ adt(v)} 21: r.apply(t, M, G) 

11: Coree := {n € core | gr(term(n), ve)} 22: progress := T 

12: ret G.to_formula(repr, G.Nodes()\coree) 23: S:= SUN 


24: Np := { (n1, n2) | n1,n2 E€ N} 
25: Tp := {term(np) | np € Np \ Sp} 
26: for each tp € Tp, r € Rp do 

27: if r.match(p) then 

28: r.apply(p, M, G) 

29: progress := T 

30: Sp := Sp U Np 

31: ret progress 


Algorithm 4: MBP-QEL: an MBP using QEL. Here gr(t,v) checks whether 
term t contains any variables in v and is_eq(t) checks if t is an equality literal. 


allows MBP-QEL to eliminate all variables, including non-Array, non-ADT vari- 
ables, that are equivalent to ground terms (Theorem 3). 

As mentioned earlier, MBP-QEL never removes terms while rewrite rules 
are saturating. Therefore, after saturation, the egraph still contains all original 
terms and variables. From soundness of the MBP rules, it follows that after 
each invocation of apply, MBP-QEL creates an under-approximation of y7? 
based on the model M. From completeness of MBP rules, it follows that, after 
saturation, all terms containing Array or ADT variables can be removed from 
the egraph without affecting equivalence of the saturated egraph. Hence, when 
calling to_formula, MBP-QEL removes all terms containing Array or ADT 
variables (line 12). This includes, in particular, all the terms on which rewrite 
rules were applied, but potentially more. 

We demonstrate our MBP algorithm on an example with nested ADTs and 
Arrays. Let P £ (Arxz, T) be the datatype of a pair of an integer array and an 
integer, and let pair : Arxr x I — P be its sole constructor with destructors 
fst : P — Arxr and snd: P — I. In the following, let i, l, j be integers, a an 
integer array, p, p' pairs, and p4, po arrays of pairs (Arx p). Consider the formula: 


Pmbp(P, a) = read(a,i) +i A^ p7% pair(a,l) \ p% write(p,,j,p) p&p" 
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where p and a are free variables that we want to project and all of i, 7,1, P1, po, p’ 
are constants that we want to keep. MBP is guided by a model Mmbp Æ| Ymbp- 
To eliminate p and a, MBP-QEL constructs the egraph of Ymbp and applies the 
MBP rules. In particular, it uses Array MBP rules to rewrite the write(p,, j, p) 
term by adding the equality read (p>, j) +p and merging it with the equivalence 
class of py ~ write(p,,j,p). It then applies ADT MBP rules to deconstruct the 
equality p ~ pair (a,l) by creating two equalities fst(p) ~a and snd(p) ~ l. Finally, 
the call to to_ formula produces 


read(fst(read(p,,j)),i) =i A snd(read(p,,j)) SLA 
read(po, j) = pair (fst (read (p1, j)), 1) ^ 
Po write(py, j, read (po, j)) A read(po, j) xp 


The output is easy to understand by tracing it back to the input. For example, 
the first literal is a rewrite of the literal read(a,i)~% where a is represented 
with fst(p) and p is represented with read(p,, j). While the interaction of these 
rules might seem straightforward in this example, the MBP implementation in 
Z3 fails to project a in this example because of the multilevel nesting. 

Notably, in this example, the c-ground computation during projection allows 
MBP-QEL not splitting on the disequality pp’ based on the model. While 
ADT MBP rules eliminate disequalities by using the model to split them, MBP- 
QEL benefits from the fact that, after the application of Array MBP rules, the 
class of p becomes ground, making pp’ c-ground. Thus, the c-ground compu- 
tation allows MBP-QEL to produce a formula that is less approximate than 
those produced by syntactic application of MBP rules. In fact, in this example, 
a quantifier elimination is obtained (the model Mmbp was not used). 

In the next section, we show that our improvements to MBP translate to 
significant improvements in a CHC-solving procedure that relies on MBP. 


6 Evaluation 


We implement QEL (Alg. 1) and MBP-QEL (Alg. 4) inside Z3 [19] (version 
4.12.0), a state-of-the-art SMT solver. Our implementation (referred to as Z3EG), 
is publicly available on GitHub®. Z3EG replaces QELITE with QEL, and the 
existing MBP with MBP-QEL. 

We evaluate Z3EG using two solving tasks. Our first evaluation is on the 
QSAT algorithm [5] for checking satisfiability of formulas with alternating quan- 
tifiers. In QSAT, Z3 uses both QELITE and MBP to under-approximate quan- 
tified formulas. We compare three QSAT implementations: the existing version 
in Z3 with the default QELITE and MBP; the existing version in Z3 in which 
QELITE and MBP are replaced by our egraph-based algorithms, Z3EG; and the 
QSAT implementation in YicEsQS°, based on the Yıces [8] SMT solver. Dur- 
ing the evaluation, we found a bug in QSAT implementation of Z3 and fixed it’. 


5 Available at https://github.com/igcontreras/z3/tree/qel-cav23. 
® Available at https://github.com/disteph/yicesQS. 
T Available at https://github.com/igcontreras/z3/commit /133c9e438ce. 
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Table 1. Instances solved within 20 min by dif- Table 2. Instances solved within 
ferent implementations. Benchmarks are quanti- 60s for our handcrafted bench- 
fied LIA and LRA formulas from SMT-LIB [2]. marks. 


r z3 Z3 
Cat. Count 28EG Z3 YicesQs Cat. Count ze 
SAT UNSAT SAT UNSAT SAT UNSAT SAT UNSAT SAT UNSAT 
LIA 416 150 266 150 266 107 102 LIA-ADT 416 150 266 150 56 


LRA 2419 795 1589 793 1595 808 1610 LRA-ADT 2419 757 1415 196 964 


The fix resulted in Z3 solving over 40 sat instances and over 120 unsat instances 
more than before. In the following, we use the fixed version of Z3. 

We use benchmarks in the theory of (quantified) LIA and LRA from SMT- 
LIB [2,3], with alternating quantifiers. LIA and LRA are the only tracks in which 
Z3 uses the QSAT tactic by default. To make our experiments more comprehen- 
sive, we also consider two modified variants of the LIA and LRA benchmarks, 
where we add some non-recursive ADT variables to the benchmarks. Specif- 
ically, we wrap all existentially quantified arithmetic variables using a record 
type ADT and unwrap them whenever they get used®. Since these benchmarks 
are similar to the original, we force Z3 to use the QSAT tactic on them with a 
tactic.default_tactic=qsat command line option. 

Table 1 summarizes the results for the SMT-LIB benchmarks. In LIA, both 
Z3EG and Z3 solve all benchmarks in under a minute, while YICESQS is unable 
to solve many instances. In LRA, YICESQS solves all instances with very good 
performance. Z3 is able to solve only some benchmarks, and our Z3EG performs 
similarly to Z3. We found that in the LRA benchmarks, the new algorithms in 
Z3EG are not being used since there are not many equalities in the formula, and 
no equalities are inferred during the run of QSAT. Thus, any differences between 
Z3 and Z3EG are due to inherent randomness of the solving process. 

Table 2 summarizes the results for the categories of mixed ADT and arith- 
metic. YICESQS is not able to compete because it does not support ADTs. As 
expected, Z3EG solves many more instances than Z3. 

The second part of our evaluation shows the efficacy of MBP-QEL for Arrays 
and ADTs (Alg. 4) in the context of CHC-solving. Z3 uses both QELITE and 
MBP inside the CHC-solver SPACER [17]. Therefore, we compare Z3 and Z3EG 
on CHC problems containing Arrays and ADTs. We use two sets of benchmarks 
to test out the efficacy of our MBP. The benchmarks in the first set were gener- 
ated for verification of Solidity smart contracts [1] (we exclude benchmarks with 
non-linear arithmetic, they are not supported by SPACER). These benchmarks 
have a very complex structure that nests ADTs and Arrays. Specifically, they 
contain both ADTs of Arrays, as well as Arrays of ADTs. This makes them suit- 
able to test our MBP-QEL. Row 1 of Table3 shows the number of instances 


8 The modified benchmarks are available at https://github.com/igcontreras/LIA- 
ADT and https://github.com/igcontreras/LRA-ADT. 
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Table 3. Instances solved within 20 min by Z3EG, Z3, and ELDARICA. Benchmarks are 
CHCs from Solidity [1] and CHC competition [13]. The abi benchmarks are a subset 
of Solidity benchmarks. 


Z3EG Z3 ELDARICA 
Cat. Count 
SAT UNSAT SAT UNSAT SAT UNSAT 
Solidity 3468 2324 1133 2314 1114 2329 1134 
abi 127 19 108 19 88 19 108 
LIA-lin- Arrays 488 214 TR 212 75 147 68 


solved by Z3 (SPACER) with and without MBP-QEL. Z3EG solves 29 instances 
more than Z3. Even though MBP is just one part of the overall SPACER algo- 
rithm, we see that for these benchmarks, MBP-QEL makes a significant impact 
on SPACER. Digging deeper, we find that many of these instances come from 
the category called abi (row 2 in Table 3). Z3EG solves all of these benchmarks, 
while Z3 fails to solve 20 of them. We traced the problem down to the MBP 
implementation in Z3: it fails to eliminate all variables, causing runtime excep- 
tion. In contrast, MBP-QEL eliminates all variables successfully, allowing Z3EG 
to solve these benchmarks. 

We also compare Z3EG with ELDARICA [14], a state-of-the-art CHC-solver 
that is particularly effective on these benchmarks. Z3EG solves almost as many 
instances as ELDARICA. Furthermore, like Z3, Z3EG is orders of magnitude faster 
than ELDARICA. Finally, we compare the performance of Z3EG on Array bench- 
marks from the CHC competition [13]. Z3EG is competitive with Z3, solving 2 
additional safe instances and almost as many unsafe instances as Z3 (row 3 of 
Table 3). Both Z3EG and Z3 solve quite a few instances more than ELDARICA. 

Our experiments show the effectiveness of our QEL and MBP-QEL in dif- 
ferent settings inside the state-ofthe-art SMT solver Z3. While we maintain 
performance on quantified arithmetic benchmarks, we improve Z3’s QSAT algo- 
rithm on quantified benchmarks with ADTs. On verification tasks, QEL and 
MBP-QEL help SPACER solve 30 new instances, even though MBP is only a 
relatively small part of the overall SPACER algorithm. 


7 Conclusion 


Quantifier elimination, and its under-approximation, Model-Based Projection 
are used by many SMT-based decision procedures, including quantified SAT 
and Constrained Horn Clause solving. Traditionally, these are implemented by 
a series of syntactic rules, operating directly on the syntax of an input formula. 
In this paper, we argue that these procedures should be implemented directly 
on the egraph data-structure, already used by most SMT solvers. This results 
in algorithms that better handle implicit equality reasoning and result in easier 
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to implement and faster procedures. We justify this argument by implement- 
ing quantifier reduction and MBP in Z3 using egraphs and show that the new 
implementation translates into significant improvements to the target decision 
procedures. Thus, our work provides both theoretical foundations for quantifier 
reduction and practical contributions to Z3 SMT-solver. 
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Abstract. Satisfiability Modulo the Theory of Nonlinear Real Arith- 
metic, SMT(NRA) for short, concerns the satisfiability of polynomial 
formulas, which are quantifier-free Boolean combinations of polynomial 
equations and inequalities with integer coefficients and real variables. In 
this paper, we propose a local search algorithm for a special subclass of 
SMT(NRA), where all constraints are strict inequalities. An important 
fact is that, given a polynomial formula with n variables, the zero level 
set of the polynomials in the formula decomposes the n-dimensional real 
space into finitely many components (cells) and every polynomial has 
constant sign in each cell. The key point of our algorithm is a new oper- 
ation based on real root isolation, called cell-jwmp, which updates the 
current assignment along a given direction such that the assignment can 
‘jump’ from one cell to another. One cell-jump may adjust the values of 
several variables while traditional local search operations, such as flip for 
SAT and critical move for SMT(LIA), only change that of one variable. 
We also design a two-level operation selection to balance the success rate 
and efficiency. Furthermore, our algorithm can be easily generalized to 
a wider subclass of SMT(NRA) where polynomial equations linear with 
respect to some variable are allowed. Experiments show the algorithm is 
competitive with state-of-the-art SMT solvers, and performs particularly 
well on those formulas with high-degree polynomials. 


Keywords: SMT - Local search - Nonlinear real arithmetic - 
Cell-jump - Cylindrical Algebraic Decomposition (CAD) 


1 Introduction 


Satisfiability modulo theories (SMT) refers to the problem of determining 
whether a first-order formula is satisfiable with respect to (w.r.t.) certain theo- 
ries, such as the theories of linear integer /real arithmetic, nonlinear integer/real 
arithmetic and strings. In this paper, we consider the theory of nonlinear real 
arithmetic (NRA) and restrict our attention to the problem of solving satisfia- 
bility of quantifier-free polynomial formulas. 

Solving polynomial constraints has been a central problem in the develop- 
ment of mathematics. In 1951, Tarski’s decision procedure [33] made it pos- 
sible to solve polynomial constraints in an algorithmic way. However, Tarski’s 
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algorithm is impractical because of its super-exponential complexity. The first 
relatively practical method is cylindrical algebraic decomposition (CAD) algo- 
rithm [13] proposed by Collins in 1975, followed by lots of improvements. See 
for example [6,14,20,22,26]. Unfortunately, those variants do not improve the 
complexity of the original algorithm, which is doubly-exponential. On the other 
hand, SMT(NRA) is important in theorem proving and program verification, 
since most complicated programs use real variables and perform nonlinear arith- 
metic operation on them. Particularly, SMT(NRA) has various applications in 
the formal analysis of hybrid systems, dynamical systems and probabilistic sys- 
tems (see the book [12] for reference). 

The most popular approach for solving SMT(NRA) is the lazy approach, 
also known as CDCL(T) [5]. It combines a propositional satisfiability (SAT) 
solver that uses a conflict-driven clause learning (CDCL) style algorithm to find 
assignments of the propositional abstraction of a polynomial formula and a the- 
ory solver that checks the consistency of sets of polynomial constraints. The 
solving effort in the approach is devoted to both the Boolean layer and the the- 
ory layer. For the theory solver, the only complete method is the CAD method, 
and there also exist many efficient but incomplete methods, such as lineari- 
sation [10], interval constraint propagation [34] and virtual substitution [35]. 
Recall that the complexity of the CAD method is doubly-exponential. In order 
to ease the burden of using CAD, an improved CDCL-style search framework, 
the model constructing satisfiability calculus (MCSAT) framework [15,21], was 
proposed. Further, there are many optimizations on CAD projection operation, 
e.g. [7,24,29], custom-made for this framework. Besides, an alternative algo- 
rithm for determining the satisfiability of conjunctions of non-linear polynomial 
constraints over the reals based on CAD is presented in [1]. 

The development of this approach brings us effective SMT(NRA) solvers. 
Almost all state-of-the-art SMT(NRA) solvers are based on the lazy approach, 
including Z3 [28], CVC5 [3], Yices2 [16] and MathSAT5 [11]. These solvers have 
made great progress in solving SMT(NRA). However, the time and memory 
usage of them on some hard instances may be unacceptable, particularly when 
the proportion of nonlinear polynomials in all polynomials appearing in the 
formula is high. It pushes us to design algorithms which perform well on these 
hard instances. 

Local search plays an important role in solving satisfiability problems, which 
is an incomplete method since it can only determine satisfiability but not unsat- 
isfiability. A local search algorithm moves in the space of candidate assignments 
(the search space) by applying local changes, until a satisfied assignment is found 
or a time bound is reached. It is well known that local search method has been 
successfully applied to SAT problems [2,4,9,23]. In recent years, some efforts 
trying to develop local search method for SMT solving are inspiring: Under 
the DPLL(T) framework, Griggio et al. [19] introduced a general procedure for 
integrating a local search solver of the WalkSAT family with a theory solver. 
Pure local search algorithms [17,30,31] were proposed to solve SMT problems 
with respect to the theory of bit-vectors directly on the theory level. Cai et al. 
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[8] developed a local search procedure for SMT on the theory of linear inte- 
ger arithmetic (LIA) through the critical move operation, which works on the 
literal-level and changes the value of one variable in a false LIA literal to make it 
true. We also notice that there exists a local search SMT solver for the theory of 
NRA, called NRA-LS, performing well at the SMT Competition 20221. A simple 
description of the solver without details about local search can be found in [25]. 

In this paper, we propose a local search algorithm for a special subclass of 
SMT(NRA), where all constraints are strict inequalities. The idea of applying the 
local search method to SMT(NRA) comes from CAD, which is a decomposition 
of the search space R” into finitely many cells such that every polynomial in the 
formula is sign-invariant on each cell. CAD guarantees that the search space only 
has finitely many states. Similar to the local search method for SAT which moves 
between finitely many Boolean assignments, local search for SMT(NRA) should 
jump between finitely many cells. So, we may use a local search framework for 
SAT to solve SMT(NRA). 

Local search algorithms require an operation to perform local changes. For 
SAT, a standard operation is flip, which modifies the current assignment by 
flipping the value of one Boolean variable from false to true or vice-versa. For 
SMT(NRA), we propose a novel operation, called cell-jump, updating the current 
assignment 21 +> @1,...,2n +? An (a; E Q) to a solution of a false polynomial 
constraint ‘p < 0’ or ‘p > 0’, where x; is a variable appearing in the given 
polynomial formula. Different from the critical move operation for linear integer 
constraints [8], it is difficult to determine the threshold value of some variable z; 
such that the false polynomial constraint becomes true. We deal with the issue by 
the method of real root isolation, which isolates every real root of the univariate 
polynomial p(a1,...,@i—1, Zi, @i41,---,@n) in an open interval sufficiently small 
with rational endpoints. If there exists at least one endpoint making the false 
constraint true, a cell-jump operation assigns x; to one closest to a;. The proce- 
dure can be viewed as searching for a solution along a line parallel to the x;-axis. 
In fact, a cell-jump operation can search along any fixed straight line, and thus 
one cell-jump may change the values of more than one variables. Each step, the 
local search algorithm picks a cell-jump operation to execute according to a two- 
level operation selection and updates the current assignment, until a solution to 
the polynomial formula is found or the terminal condition is satisfied. Moreover, 
our algorithm can be generalized to deal with a wider subclass of SMT(NRA) 
where polynomial equations linear w.r.t. some variable are allowed. 

The local search algorithm is implemented with Maple2022 as a tool. Experi- 
ments are conducted to evaluate the tool on two classes of benchmarks, including 
selected instances from SMT-LIB?, and some hard instances generated randomly 
with only nonlinear constraints. Experimental results show that our tool is com- 
petitive with state-of-the-art SMT solvers on the SMT-LIB benchmarks, and 
performs particularly well on the hard instances. We also combine our tool with 


1 https: //smt-comp.github.io/2022. 
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Z3, CVC5, Yices2 and MathSAT5 respectively to obtain four sequential portfolio 
solvers, which show better performance. 

The rest of the paper is organized as follows. The next section introduces some 
basic definitions and notation and a general local search framework for solving 
a satisfiability problem. Section 3 shows from the CAD perspective, the search 
space for SMT(NRA) only has finite states. In Sect.4, we describe cell-jump 
operations, while in Sect.5 we provide the scoring function which gives every 
operation a score. The main algorithm is presented in Sect.6. And in Sect. 7, 
experimental results are provided to indicate the efficiency of the algorithm. 
Finally, the paper is concluded in Sect. 8. 


2 Preliminaries 


2.1 Notation 


Let Z := (£1,..., £n) be a vector of variables. Denote by Q, R and Z the set of 
rational numbers, real numbers and integer numbers, respectively. Let Q[z] and 
R[x] be the ring of polynomials in the variables z1, ..., £n with coefficients in Q 
and in R, respectively. 


Definition 1 (Polynomial Formula). Suppose A = {P;,..., Pm} where every 
P, is a non-empty finite subset of Q|z]. The following formula 


F= VAN V Pij(£1,-.., £n) Dij 0, where Dij E {<,>,=}, 
PEA pijEPi 


is called a polynomial formula. Additionally, we call pij(£1,..., £n) Dij O an 
atomic polynomial formula, and V nig €P; Pij(£1,-.-, £n) Dij O a polynomial 
clause. 


For any polynomial formula F, poly(F) denotes the set of polynomials 
appearing in F. For any atomic formula £, poly(£) denotes the polynomial 
appearing in @ and rela(¢) denotes the relational operator (‘<’, ‘>’ or ‘=’) 
of £. 

For any polynomial formula F, an assignment is a mapping a: Z — R” such 
that a(z) = (a1,...,@,) where a; € R. Given an assignment a, 


— an atomic polynomial formula is true under a if it evaluates to true under a, 
and otherwise it is false under a, 

— a polynomial clause is satisfied under a if at least one atomic formula in the 
clause is true under a, and falsified under a otherwise. 


When the context is clear, we simply say a true (or false) atomic polynomial 
formula and a satisfied (or falsified) polynomial clause. A polynomial formula 
is satisfiable if there exists an assignment a such that all clauses in the formula 
are satisfied under a, and such an assignment is a solution to the polynomial 
formula. A polynomial formula is unsatisfiable if any assignment is not a solution. 
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2.2 A General Local Search Framework 


When applying local search algorithms to solve a satisfiability problem, the 
search space is the set of all assignments. A general local search framework 
begins with a complete, initial assignment. Every time, one of the operations 
with the highest score is picked and the assignment is updated after executing 
the operation until reaching the set terminal condition. Below, we give the formal 
definitions of operation and scoring function. 


Definition 2 (Operation). Let F be a formula. Given an assignment a which 
is not a solution of F, an operation modifies a to another assignment a’. 


Definition 3 (Scoring Function). Let F be a formula. Suppose a is the 
current assignment and op is an operation. A scoring function is defined as 
score(op,a@) := cost(a) — cost(a’), where the real-valued function cost mea- 
sures the cost of making F satisfied under an assignment according to some 
heuristic, and a’ is the assignment after executing op. 


Example 1. In local search algorithms for SAT, a standard operation is flip, 
which modifies the current assignment by flipping the value of one Boolean vari- 
able from false to true or vice-versa. A commonly used scoring function measures 
the change on the number of falsified clauses by flipping a variable. Thus, oper- 
ation op is flip(b) for some Boolean variable b, and cost(q@) is interpreted as 
the number of falsified clauses under the assignment a. 


Actually, only when score(op, a) is a positive number does it make sense to 
execute operation op, since the operation guides the current assignment to an 
assignment with less cost of being a solution. 


Definition 4 (Decreasing Operation). Suppose a is the current assignment. 
Given a scoring function score, an operation op is a decreasing operation under 
a if score(op,a) > 0. 


A general local search framework is described in Algorithm 1. The framework 
was used in GSAT [27] for solving SAT problems. Note that if the input formula 
F is satisfied, Algorithm 1 outputs either (i) a solution of F if the solution is 
found successfully, or (ii) “unknown” if the algorithm fails. 


Algorithm 1. General Local Search Framework 


Input : a formula F and a terminal condition ~ 
Output: a solution to F or unknown 


initialize assignment a 
while the terminal condition p is not satisfied do 
if a satisfies F then 
| return a 
else 
| op + one of the decreasing operations with the highest score 


Noa poner 


perform op to modify a 


8 return unknown 
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3 The Search Space of SMT(NRA) 


The search space for SAT problems consists of finitely many assignments. So, 
theoretically speaking, a local search algorithm can eventually find a solution, 
as long as the formula indeed has a solution and there is no cycling during 
the search. It seems intuitive, however, that the search space of an SMT(NRA) 
problem, e.g. R”, is infinite and thus search algorithms may not work. 

Fortunately, due to Tarski’s work and the theory of CAD, SMT(NRA) is 
decidable. Given a polynomial formula in n variables, by the theory of CAD, 
R” is decomposed into finitely many cells such that every polynomial in the 
formula is sign-invariant on each cell. Therefore, the search space of the problem 
is essentially finite. The cells of SMT(NRA) are very similar to the Boolean 
assignments of SAT, so just like traversing all Boolean assignments in SAT, 
there exists a basic strategy to traverse all cells. 

In this section, we describe the search space of SMT(NRA) based on the 
CAD theory from a local search perspective, providing a theoretical foundation 
for the operators and heuristics we will propose in the next sections. 


Example 2. Consider the polynomial formula 


F = (fi >0V fe >0)A (fi < OV f2 <0), 


where fı = 17x?+2ay+17y?+48x—48y and fo = 17x?—2ary+17y? —482 —48y. 

The solution set of F is shown as the shaded area in Fig.1. Notice 
that poly(F) consists of two polynomials and decomposes R? into 10 areas: 
Ci,...,C1o0 (see Fig. 2). We refer to these areas as cells. 


y y 


\ 


=2 =2 


Fig. 1. The solution set of F in Example Fig. 2. The zero level set of poly(F) 
2. decomposes R? into 10 cells. 


Definition 5 (Cell). For any finite set Q C RĪT], a cell of Q is a maximally 
connected set in R” on which the sign of every polynomial in Q is constant. For 
any point @ € R”, we denote by cel1(Q,a) the cell of Q containing a. 


By the theory of CAD, we have 
Corollary 1. For any finite set Q C R[#], the number of cells of Q is finite. 


It is obvious that any two cells of Q are disjoint and the union of all cells of Q 
equals R”. Definition 5 shows that for a polynomial formula F with poly(F’) = Q, 
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the satisfiability of F is constant on every cell of Q, that is, either all the points 
in a cell are solutions to F or none of them are solutions to F. 


Example 8. Consider the polynomial formula F in Example 2. As shown in 
Fig. 3, assume that we start from point a to search for a solution to F. Jumping 
from a to b makes no difference, as both points are in the same cell and thus 
neither are solutions to F. However, jumping from a to c or from a to d crosses 
different cells and we may discover a cell satisfying F. Herein, the cell containing 
d satisfies F. 


o o -2 o o 
Fig. 3. Jumping from point a to search Fig. 4. A cylindrical expansion of a 
for a solution of F. cylindrically complete set containing 


poly(F). 


For the remainder of this section, we will demonstrate how to traverse all 
cells through point jumps between cells. The method of traversing cell by cell in 
a variable by variable direction will be explained step by step from Definition 6 
to Definition 8. 


Definition 6 (Expansion). Let Q C R|] be finite and a = (a1,...,an) E 

R”. Given a variable x; (1 < i < n), let rı < +--+ < rs be all real roots of 

{q(ai,.-.,@i-1, Zi, Qi+1;---;0n) | Q(G1,.--,Qi—-1, Zi, Qi+1;---;an) #0, g E Q}, 

where s € Z>o9. An expansion of @ to x; on Q is a point set A C R” satisfying 

(a) a € A and (Oty cing By TH Giese Oy) EA fori<j<s, 

(b) for any b = (b1, ...,bn) € A, bj = aj for j € {1,... n} \ {i}, and 

(c) for any interval I € {(—00,11),(71,72),---;(%s—1,1s); (rs, +00)}, there 
exists a unique b = (b1, ...,bn) € A such that bi € I. 


For any point set {āa™®,..., a0} C R”, an expansion of the set to x; on Q is 
U Aj, where A; is an expansion of a to a; on Q. 


Example 4. Consider the polynomial formula F in Example 2. The set of black 
solid points in Fig.3, denoted as A, is an expansion of point (0,0) to 2 on 
poly(F). The set of all points (including black solid points and hollow points) 
is an expansion of A to y on poly(F). 


As shown in Fig. 3, an expansion of a point to some variable is actually a 
result of the point continuously jumping to adjacent cells along that variable 
direction. Next, we describe the expansion of all variables in order, which is 
a result of jumping from cell to cell along the directions of variables w.r.t. a 
variable order. 
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Definition 7 (Cylindrical Expansion). Let Q C R[z] be finite anda e R”. 
Given a variable order x1 < --- < £n, a cylindrical expansion of a w.r.t. the 
variable order on Q is Gian Ai, where A, is an expansion of @ to xı on Q, and 
for2<i<n, A; is an expansion of A;_1 to x; on Q. When the context is clear, 
we simply call |J; Ai a cylindrical expansion of Q. 


Example 5. Consider the formula F in Example 2. It is clear that the set of all 
points in Fig. 3 is a cylindrical expansion of point (0,0) w.r.t. 2 < y on poly(F). 
The expansion actually describes the following jumping process. First, the origin 
(0,0) jumps along the x-axis to the black points, and then those black points 
jump along the y-axis direction to the white points. 


Clearly, a cylindrical expansion is similar to how a Boolean vector is flipped 
variable by variable. Note that the points in the expansion in Fig. 3 do not cover 
all the cells (e.g. Cz and Cg in Fig.2), but if we start from (0,2), all the cells 
can be covered. This implies that whether all the cells can be covered depends 
on the starting point. 


Definition 8 (Cylindrically Complete). Let Q C R[z] be finite. Given a 
variable order x1 < ++: < £n, Q is said to be cylindrically complete w.r.t. the 
variable order, if for any a € R” and for any cylindrical expansion A of a w.r.t. 
the order on Q, every cell of Q contains at least one point in A. 


Theorem 1. For any finite set Q C R|] and any variable order, there exists 
Q’ such that Q C Q' C R[#] and Q’ is cylindrically complete w.r.t. the variable 
order. 


Proof. Let Q’ be the projection set of Q [6,13,26] obtained from the CAD pro- 
jection operator w.r.t. the variable order. According to the theory of CAD, Q’ 
is cylindrically complete. 


Corollary 2. For any polynomial formula F and any variable order, there exists 
a finite set Q C R[x] such that for any cylindrical expansion A of Q, every cell 
of poly(F) contains at least one point in A. Furthermore, F is satisfiable if and 
only if F has solutions in A. 


Example 6. Consider the polynomial formula F in Example 2. By the proof of 
Theorem 1, Q’ := {x,-2— 32+ 27, -2+32+27, 1094441727, fi, fo} isa 
cylindrically complete set w.r.t. x < y containing poly(F). As shown in Fig. 4, 
the set of all (hollow) points is a cylindrical expansion of point (0,0) w.r.t. £ < y 
on Q’, which covers all cells of poly(F). 


Corollary 2 shows that for a polynomial formula F, there exists a finite 
set Q C R{[z] such that we can traverse all the cells of poly(F) through a 
search path containing all points in a cylindrical expansion of Q. The cost of 
traversing the cells is very high, and in the worst case, the number of cells will 
grow exponentially with the number of variables. 

The key to building a local search on SMT(NRA) is to construct a heuristic 
search based on the operation of jumping between cells. 
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4 The Cell-Jump Operation 


In this section, we propose a novel operation, called cell-jump, that performs local 
changes in our algorithm. The operation is determined by the means of real root 
isolation. We review the method of real root isolation and define sample points 
in Sect. 4.1. Section 4.2 and Sect. 4.3 present a cell-jump operation along a line 
parallel to a coordinate axis and along any fixed straight line, respectively. 


4.1 Sample Points 


Real root isolation is a symbolic way to compute the real roots of a polynomial, 
which is of fundamental importance in computational real algebraic geometry 
(e.g., it is a routing sub-algorithm for CAD). There are many efficient algorithms 
and popular tools in computer algebra systems such as Maple and Mathematica 
to isolate the real roots of polynomials. 

We first introduce the definition of sequences of isolating intervals for nonzero 
univariate polynomials, which can be obtained by any real root isolation tool, 
e.g. CLPoly?. 


Definition 9 (Sequence of Isolating Intervals). For any nonzero univariate 
polynomial p(x) € Q|a], a sequence of isolating intervals of p(x) is a sequence 
of open intervals (a1, b1),...,(@s,bs) where s € Zso, such that 


(i) for each i (1 < i < s), aj,b; E Q, a; < bi and bi < aja4, 
(ii) each interval (a;i, bi) (1 < i < s) has exactly one real root of p(x), and 
(itt) none of the real roots of p(x) are in R \ U- (ai, bi). 


Specially, the sequence of isolating intervals is empty, i.e., s = 0, when p(x) has 
no real roots. 


By means of sequences of isolating intervals, we define sample points of uni- 
variate polynomials, which is the key concept of the cell-jump operation proposed 
in Sect. 4.2 and Sect. 4.3. 


Definition 10 (Sample Point). For any nonzero univariate polynomial 
p(x) € Qla], let (ai, b1),...,(as,0s) be a sequence of isolating intervals of p(x) 
where s € Z>o. Every point in the set {a1, bs} UUS] fbi, bataets aj41} is a sam- 
ple point of p(x). If x* is a sample point of p(x) and p(x*) > 0 (or p(a*) < 0), 
then x* is a positive sample point (or negative sample point) of p(x). For the 
zero polynomial, it has no sample point, no positive sample point and no neg- 
ative sample point. 


Remark 1. For any nonzero univariate polynomial p(x) that has real roots, let 
T1,--.,%s (s € Z>ı) be all distinct real roots of p(x). It is obvious that the 
sign of p(x) is positive constantly or negative constantly on each interval I of 
the set {(—00, 11), (71, r2), -< -, (rs—1, Ts), (rs, +00) }. So, we only need to take a 
point «* from the interval J, and then the sign of p(x*) is the constant sign of 


3 https: //github.com/lihaokun/CLPoly. 
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p(x) on I. Specially, we take a, as the sample point for the interval (—oo, r1), 
bi, bitai or a@j41 as a sample point for (ri, ri+1) where 1 < i < s — 1, and bs 
as the sample point for (rs, +00). By Definition 10, there exists no sample point 


for the zero polynomial and a univariate polynomial with no real roots. 


Example 7. Consider the polynomial p(x) = xë — 4x° + 6x4 — 4r? + 1. It has two 


real roots —1 and 1, and a sequence of isolating intervals of it is (25, —-#), 

(3, 215), Every point in the set { 25, ve ,0, 2, 215 is a sample point of p(x). 

Note that p(x) > 0 holds on the intervals (—oo, —1) and (1, +00), and p(x) < 0 

holds on the interval (—1,1). Thus, -2 and fo are positive sample points of 
19 


p(z); — 53,0 and is are negative sample points of p(x). 


4.2 Cell-Jump Along a Line Parallel to a Coordinate Axis 


The critical move operation [8, Definition 2] is a literal-level operation. For any 
false LIA literal, the operation changes the value of one variable in it to make 
the literal true. In the subsection, we propose a similar operation which adjusts 
the value of one variable in a false atomic polynomial formula with ‘<’ or ‘>’. 


Definition 11. Suppose the current assignment is a: 21 Œœ Q1,..., En © An 
where a; E€ Q. Let £ be a false atomic polynomial formula under a with a rela- 
tional operator <’ or >’. 


(i) Suppose £ is p(&) < 0. For each variable x; such that the univariate polyno- 
mial p(a1,...,Qi—1, Zi, @i41,---,Qn) has negative sample points, there exists 
a cell-jump operation, denoted as cjump(x;,0), assigning x; to a negative 
sample point closest to ai. 

(ii) Suppose £ is p(z) > 0. For each variable x; such that the univariate polyno- 
mial p(a1,..-,@i—1,©i, @i+1,---,@n) has positive sample points, there exists 
a cell-jump operation, denoted as cjump(x;,@), assigning x; to a positive 
sample point closest to ai. 


Every assignment in the search space can be viewed as a point in R”. Then, 
performing a cjump(z;, £) operation is equivalent to moving one step from the 
current point a(g) along the line (a1,..., a@j-1, R, ai41,...,@n). Since the line is 
parallel to the x;-axis, we call cjump(z;, £) a cell-jump along a line parallel to a 
coordinate axis. 


Theorem 2. Suppose the current assignment is @ : L1 => Q1,..., Zn H| An 
where a; E€ Q. Let £ be a false atomic polynomial formula under a with a rela- 
tional operator ‘<’ or >’. For every i (1 < i < n), there exists a solution of 
L in {a | a' (©) € (a1,...,ai—1, R, ai41,-..,€n)} if and only if there exists a 
cjump(zx;, £) operation. 


Proof. < It is clear by the definition of negative (or positive) sample points. 
=> Let S := {a’ | a' (Œ) € (a1,...,ai—1, R, i41,- --,an)}. It is equivalent to 
proving that if there exists no cjump(x;, l) operation, then no solution to £ exists 
in S. We only prove it for £ of the form p(z) < 0. Recall Definition 10 and Remark 
1. There are only three cases in which cjump(z;, £) does not exist: (1) p* is the 
zero polynomial, (2) p* has no real roots, (3) p* has a finite number of real roots, 
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say T1,...,1Ts (s € Zs), and p* is positive on R \ {ri,..., 7s}, where p* denotes 
the polynomial p(ai,...,@j—1, Zi, @i41,---,@n). In the first case, p(a’(z%)) = 0 
and in the third case, p(a’()) > 0 for any assignment a’ € S. In the second 
case, the sign of p* is positive constantly or negative constantly on the whole 
real axis. Since £ is false under a, we have p(a(#)) > 0, that is, p*(a;) > 0. So, 
p*(a;) > 0 for any z; € R, which means p(a’(%)) > 0 for any a’ € S. Therefore, 
no solution to £ exists in S' in the three cases. That completes the proof. 


The above theorem shows that if cjump(z;, £) does not exist, then there is 
no need to search for a solution to £ along the line (a1,.. . , aj;-1, R, aj41,---, Gn). 
And we can always obtain a solution to £ after executing a cjump(2;, l) operation. 


Example 8. Assume the current assignment is a : zı > 1, £2 > 1. come two 
false atomic polynomial formulas 44 : 27? + 273 — 1 < 0 and b : x§a3 — 4z$ + 
6xfxz — 4r? + £2 > 0. Let pı := poly(é1) and pz := poly(42). 

We first consider cjump(2;, l1). For the variable x1, the corresponding uni- 
variate polynomial is p;(2,,1) = 2z? + 1, and for x2, the corresponding one is 
pı(l, £2) = 273 + 1. Both of them have no real roots, and thus there exists no 
cjump(x1, l1) operation and no cjump(x2, l1) operation for 41. Applying Theo- 
rem 2, we know a solution of 4, can only locate in R? \ (1, R) U (R, 1) (also see 
Fig. 5 (a)). So, we cannot find a solution of 4; through one-step cell-jump from 
the assignment point (1,1) along the lines (1,R) and (R, 1). 

Then consider cjump(#;, ¢2). For the variable x1, the corresponding univariate 
polynomial is p2(x1,1) = x} — 4x$ + 6x} — 4x? + 1. Recall Example 7. There are 
two positive sample points of pə(x1,1) : BE, 215, And 215 is the closest one 
to a(xı). So, cjump(z1, £2) assigns xı to 52. After executing cjump(1, l2), the 
assignment becomes a’ : xı ++ 752, £3 +> 1 which is a solution of 2. For the 
variable x2, the corresponding polynomial is pə(1, £2) = £3 + 7x2 — 8, which has 
one real root 1. A sequence of isolating intervals of p2(1, x2) is (33, 32), and 32 


is the only positive sample point. So, cjump(x2, l2) assigns x2 to ae. and then 


the assignment becomes a” : xı œ> 1, £2 > 25 which is another solution of l3. 


4.3 Cell-Jump Along a Fixed Straight Line 


Given the current assignment a such that a(z) = (a1,...,@n) € Q”, a false 
atomic polynomial formula £ of the form p(z) > 0 or p(#%) < 0 and a vector 
dir = (di,...,dn) € Q”, we propose Algorithm 2 to find a cell-jump operation 
along the straight line L specified by the point a(%) and the direction dir, 
denoted as cjump(dir, £). 

In order to analyze the values of p(z) on line L, we introduce a new variable 
t and replace every x; in p(Z) with a; + d;t to get p*(t). If rela(¢) =‘<’ and 
p*(t) has negative sample points, there exists a cjump(dir, £) operation. Let t* 
be a negative sample point of p*(t) closest to 0. The assignment becomes a’ : 
zı |œ a1 + dit*,..., £n > an + d,t* after executing the operation cjump(dir, £). 
It is obvious that a’ is a solution to £. If rela(¢) =‘>’ and p*(t) has positive 
sample points, the situation is similar. Otherwise, £ has no cell-jump operation 
along line L. 
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Similarly, we have: 


Theorem 3. Suppose the current assignment is œ : 21 => Q1,..., Zn / An 
where a; E Q. Let £ be a false atomic polynomial formula under a with a 
relational operator ‘<’ or >’, dir := (d1,...,dn) a vector in Q” and L := 


{(a, + dit,...,@n + dnt) | t € R}. There exists a solution of £ in L if and only 
if there exists a cjump(dir, £) operation. 


Theorem 3 implies that through one-step cell-jump from the point a(g) along 
any line that intersects the solution set of £, a solution to £ will be found. 


Example 9. Assume the current assignment is @: zı +> 1, £2 > 1. Consider 
the false atomic polynomial formula ¢, : 277 + 2x3 — 1 < 0 in Example 8. Let 
p := poly(¢,). By Fig.5 (b), the line (line La) specified by the point a(z) and 
the direction vector dir = (1,1) intersects the solution set of ¢;. So, there exists 
a cjump(dir, £1) operation by Theorem 3. Notice that the line can be described 
in a parametric form, that is {(@1, £2) | z1 = 1+t, z2 = 1+t where t € R}. Then, 
analyzing the values of p(%) on the line is equivalent to analyzing those of p*(t) 
on the real axis, where p* (t) = p(1+t, 1+t) = 4t?+8t+3. A sequence of isolating 
intervals of p* is (—222,-%), (— 22, £1), and there are two negative sample 


: me, ig 128? ge? S32)” T28 
points: —%7, —35- Since — 35 is the closest one to 0, the operation cjump(dir, 41) 
changes the assignment to a’ : zı +> 33, g + 3, which is a solution of 


lı. Again by Fig.5, there are other lines (the dashed lines) that go through 
a(z) and intersect the solution set. So, we can also find a solution to ¢; along 
these lines. Actually, for any false atomic polynomial formula with ‘<’ or ‘>’ 
that really has solutions, there always exists some direction dir in Q” such that 
cjump(dir, £) finds one of them. Therefore, the more directions we try, the greater 
the probability of finding a solution of £. 


Algorithm 2.. Cell-Jump Along a Fixed Straight Line 


Input : a= (a1,..., an), the current assignment z1 +> a1,...,%n ++ an where a; E€ Q 
£, a false atomic polynomial formula under a with a relational operator ‘<’ or ‘>’ 
dir = (d1, ..., dn), a vector in Q” 


Output: a’, the assignment after executing a cjump(dir, £) operation, which is a solution to £; 
FAIL, if there exists no cjump(dir, £) operation 


1 p= poly(é) 

2 p* + replace every x; in p with a; + d;t, where t is a new variable 
3 if rela(£) =‘<’ and p* has negative sample points then 

4 t* — a negative sample point of p* closest to 0 

5 a’ — (a1 +: dit*,...,an + dnt“) 

6 return a’ 


if rela(€) =‘>’ and p* has positive sample points then 


q 


8 t* — a positive sample point of p* closest to 0 
9 a! — (a1 + dit*,...,an +dnt*) 
10 return a’ 


11 return FAIL 
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Tie. 2 L2 
(a) Neither Lı nor Lə intersects (b) Line L3 and the dashed lines 
the solution set. intersect the solution set. 


Fig. 5. The figure of the cell-jump operations along the lines Lı, L2 and L3 for the 
false atomic polynomial formula £; : 22? + 2x2 — 1 < 0 under the assignment a : xı |> 
1,£2 ++ 1. The dashed circle denotes the circle 2z? + 223 —1 = 0 and the shaded part in 
it represents the solution set of the atom. The coordinate of point A is (1,1). Lines Li, 
Lə and Lz pass through A and are parallel to the x1-axis, the x2-axis and the vector 
(1,1), respectively. 


Remark 2. For a false atomic polynomial formula £ with ‘<’ or ‘>’, cjump(x;, £) 
and cjump(dir,?) make an assignment move to a new assignment, and both 
assignments map to an element in Q”. In fact, we can view cjump(x;, £) as a 
special case of cjump(dir, £) where the i-th component of dir is 1 and all the other 
components are 0. The main difference between cjump(2;, £) and cjump(dir, £) is 
that cjump(z;, £) only changes the value of one variable while cjump(dir, £) may 
change the values of many variables. The advantage of cjump(2;,@) is to avoid 
that some atoms can never become true when the values of many variables are 
adjusted together. However, performing cjump(dir, £) is more efficient in some 
cases, since it may happen that a solution to £ can be found through one-step 
cjump(dir, 2), but through many steps of cjump(z;, 2). 


5 Scoring Functions 


Scoring functions guide local search algorithms to pick an operation at each step. 
In this section, we introduce a score function which measures the difference of 
the distances to satisfaction under the assignments before and after performing 
an operation. 

First, we define the distance to truth of an atomic polynomial formula. 


Definition 12 (Distance to Truth). Given the current assignment a such 
that a(@) = (a1,...,4n) E Q” and a positive parameter pp E€ Qso, for an atomic 
polynomial formula € with p := poly(é), its distance to truth is 


0, if a is a solution to £, 


dtt(l, a, = 
ne a .--,an)|+ pp, otherwise. 


For an atomic polynomial formula £, the parameter pp is introduced to guar- 
antee that the distance to truth of £ is 0 if and only if the current assignment 
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a is a solution of £. Based on the definition of dtt, we use the method of [8, 
Definition 3 and 4] to define the distance to satisfaction of a polynomial clause 
and the score of an operation, respectively. 


Definition 13 (Distance to Satisfaction). Given the current assignment a 
and a parameter pp E Qso, the distance to satisfaction of a polynomial clause c 
is dts(c, a, pp) = minge{dtt(E, a, pp) }- 


Definition 14 (Score). Given a polynomial formula F, the current assignment 
a and a parameter pp E€ Qso, the score of an operation op is defined as 


score(op, a, pp) := )_(dts(c, a, pp) — dts(c, a’, pp)) - w(c), 
ceF 


where w(c) denotes the weight of clause c, and a’ is the assignment after per- 
forming op. 


Note that the definition of the score is associated with the weights of clauses. 
In our algorithm, we employ the probabilistic version of the PAWS scheme [9, 
32] to update clause weights. The initial weight of every clause is 1. Given a 
probability sp, the clause weights are updated as follows: with probability 1— sp, 
the weight of every falsified clause is increased by one, and with probability sp, 
for every satisfied clause with weight greater than 1, the weight is decreased by 
one. 


6 The Main Algorithm 


Based on the proposed cell-jump operation (see Sect. 4) and scoring function (see 
Sect.5), we develop a local search algorithm, called LS Algorithm, for solving 
satisfiability of polynomial formulas in this section. The algorithm is a refined 
extension of the general local search framework as described in Sect. 2.2, where 
we design a two-level operation selection. The section also explains the restart 
mechanism and an optimization strategy used in the algorithm. 

Given a polynomial formula F such that every relational operator appearing 
in it is ‘<’ or ‘>’ and an initial assignment that maps to an element in Q”, LS 
Algorithm (Algorithm 3) searches for a solution of F from the initial assignment, 
which has the following four steps: 


(i) Test whether the current assignment is a solution if the terminal 
condition is not reached. If the assignment is a solution, return the 
solution. If it is not, go to the next step. The algorithm terminates at once 
and returns “unknown” if the terminal condition is satisfied. 

(ii) Try to find a decreasing cell-jump operation along a line parallel to 
a coordinate axis. We first need to check whether such an operation exists. 
That is, to determine whether the set D is empty, where D = {cjump(a;, £) | 
L is a false atom, z; appears in £ and cjump(zx;,£) is decreasing}. If D = 
Ø, go to the next step. Otherwise, we adopt the two-level heuristic in [8, 
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Section 4.2]. The heuristic distinguishes a special subset S C D from the 
rest of D, where S = {cjump(z;,¢) € D | £ appears in a falsified clause}, 
and searches for an operation with the highest score from S. If it fails to find 
any operation from S (i.e. S = Ø), then it searches for one with the highest 
score from D \ S. Perform the found operation and update the assignment. 
Go to Step (i). 

(iii) Update clause weights according to the PAWS scheme. 

(iv) Generate some direction vectors and try to find a decreas- 
ing cell-jump operation along a line parallel to a gener- 
ated vector. Since it fails to execute a decreasing cell-jump opera- 
tion along any line parallel to a coordinate axis, we generate some 
new directions and search for a decreasing cell-jump operation along 
one of them. The candidate set of such operations is {cjump(dir, £) | 
£ isafalseatom, dir isagenerateddirection and cjump(dir, £) isdecreasing}. 
If the set is empty, the algorithm returns “unknown”. Otherwise, we use 
the two-level heuristic in Step (ii) again to choose an operation from the 
set. Perform the chosen operation and update the assignment. Go to Step 


(i). 


We propose a two-level operation selection in LS Algorithm, which prefers to 
choose an operation changing the values of less variables. Concretely, only when 
there does not exist a decreasing cjump(x;, l) operation that changes the value of 
one variable, do we update clause weights and pick a cjump(dir, 2) operation that 
may change values of more variables. The strategy makes sense in experiments, 
since it is observed that changing too many variables together at the beginning 
might make some atoms never become true. 

It remains to explain the restart mechanism and an optimization strategy. 


Restart Mechanism. Given any initial assignment, LS Algorithm takes it 
as the starting point of the local search. If the algorithm returns “unknown”, 
we restart LS Algorithm with another initial assignment. A general local search 
framework, like Algorithm 1, searches for a solution from only one starting point. 
However, the restart mechanism allows us to search from more starting points. 
The approach of combining the restart mechanism and a local search procedure 
also aids global search, which finds a solution over the entire search space. 

We set the initial assignments for restarts as follows: All variables are assigned 
with 1 for the first time. For the second time, for a variable x;, if there exists 
clause x; < ub V x; = ub or x; > IDV x; = lb, then x; is assigned with ub or lb; 
otherwise, x; is assigned with 1. For the i-th time (3 < i < 7), every variable 
is assigned with 1 or —1 randomly. For the i-th time (i > 8), every variable is 
assigned with a random integer between —50(i — 6) and 50(i — 6). 


Forbidding Strategies. An inherent problem of the local search method is 
cycling, i.e., revisiting assignments. Cycle phenomenon wastes time and prevents 
the search from getting out of local minima. So, we employ a popular forbidding 
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strategy, called tabu strategy [18], to deal with it. The tabu strategy forbids 
reversing the recent changes and can be directly applied in LS Algorithm. Notice 
that every cell-jump operation increases or decreases the values of some variables. 
After executing an operation that increases/decreases the value of a variable, 
the tabu strategy forbids decreasing/increasing the value of the variable in the 
subsequent tt iterations, where tt € Z>o is a given parameter. 


Algorithm 3. LS Algorithm 


Input : F, a polynomial formula such that the relational operator of every atom is ‘<’ or ‘>’ 
inita, an initial assignment that maps to an element in Q” 
Output: a solution (in Q”) to F or unknown 


a — inita 

while the terminal condition is not reached do 

if a satisfies F then return a 

fal-cl — the set of atoms in falsified clauses 

sat_cl — the set of false atoms in satisfied clauses 

if 3 a decreasing cjump(zx;, £) operation where £ € fal_cl then 
op + such an operation with the highest score 

a + a with op performed 

else if 3 a decreasing cjump(x:;, £) operation where £ € sat_cl then 
op + such an operation with the highest score 

a@ +— a with op performed 


OMNOAARWNEH 


H H 
e O 


else 


m 
N 


update clause weights according to the PAWS scheme 

generate a direction vector set dset 

if 3 a decreasing cjump(dir, £) operation where dir € dset and £ € fal_cl then 

op + such an operation with the highest score 

a <— a with op performed 

else if 3 a decreasing cjump(dir, £) operation where dir € dset and £ € sat_cl then 
op — such an operation with the highest score 

a << a with op performed 


NBER RP ee eR 
COMNDAA RW 


else 


N 
H 


return unknown 


N 
N 


23 return unknown 


Remark 3. If the input formula has equality constraints, then we need to define 
a cell-jump operation for a false atom of the form p(z) = 0. Given the current 
assignment @ : £1 +> Q1, .. -, Zn H| An (a; € Q), the operation should assign some 
variable x; to a real root of p(a1,...,Qi—1, Zi, @i41,---,;@n), which may be not a 
rational number. Since it is time-consuming to isolate real roots of a polynomial 
with algebraic coefficients, we must guarantee that all assignments are rational 
during the search. Thus, we restrict that for every equality equation p(%) = 0 
in the formula, there exists at least one variable such that the degree of p w.r.t. 
the variable is 1. Then, LS Algorithm also works for such a polynomial formula 
after some minor modifications: In Line 6 (or Line 9), for every atom ¢ € fal-cl 
(or £ € sat-cl) and for every variable x;, if has the form p(x) = 0, p is linear 
w.r.t. x; and p(@1,...,@;-1, Ti, @j41,---,@n) is not a constant polynomial, there 
is a candidate operation that changes the value of x; to the (rational) solution 
of p(a1,.--,@i—1, Li, @i41,---,@n) = 0; if Z has the form p(z) > 0 or p(#) < 0, a 
candidate operation is cjump(;, £). We perform a decreasing candidate operation 
with the highest score if such one exists, and update a in Line 8 (or Line 11). 
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In Line 15 (or Line 18), we only deal with inequality constraints from fal_cl (or 
sat_cl), and skip equality constraints. 


7 Experiments 


We carried out experiments to evaluate LS Algorithm on two classes of instances, 
where one class consists of selected instances from SMT-LIB while another is 
generated randomly, and compared our tool with state-of-the-art SMT(NRA) 
solvers. Furthermore, we combine our tool with Z3, CVC5, Yices2 and Math- 
SAT5 respectively to obtain four sequential portfolio solvers, which show better 
performance. 


7.1 Experiment Preparation 


Implementation: We implemented LS Algorithm with Maple2022 as a tool, 
which is also named LS. There are 3 parameters in LS Algorithm: pp for com- 
puting the score of an operation, tt for the tabu strategy and sp for the PAWS 
scheme, which are set as pp = 1, tt = 10 and sp = 0.003. The direction vec- 
tors in LS Algorithm are generated in the following way: Suppose the current 


assignment is 21 +> @1,...,%n + Gn (a; E Q) and the polynomial appearing in 
the atom to deal with is p. We generate 12 vectors. The first one is the gradient 
vector (x2, Saul FP Var a, a,): The second one is the vector (a1, ..., an). And 


the rest are random vectors where every component is a random integer between 
—1000 and 1000. 


Experiment Setup: All experiments were conducted on 16-Core Intel Core 
i9-12900KF with 128GB of memory and ARCH LINUX SYSTEM. We compare 
our tool with 4 state-of-the-art SMT(NRA) solvers, namely Z3 (4.11.2), CVC5 
(1.0.3), Yices2 (2.6.4) and MathSAT5 (5.6.5). Each solver is executed with a 
cutoff time of 1200 seconds (as in the SMT Competition) for each instance. We 
also combine LS with every competitor solver as a sequential portfolio solver, 
referred to as “LS+OtherSolver”, where we first run LS with a time limit of 10 
seconds, and if LS fails to solve the instance within that time, we then proceed 
to run OtherSolver from scratch, allotting it the remaining 1190 seconds. 


7.2 Instances 


We prepare two classes of instances. One class consists of in total 2736 unknown 
and satisfiable instances from SMT-LIB(NRA)*, where in every equality poly- 
nomial constraint, the degree of the polynomial w.r.t. each variable is less than 
or equal to 1. 

The rest are random instances. Before introducing the generation approach 
of random instances, we first define some notation. Let rn(down, up) denote a 


* https://clc-gitlab.cs.uiowa.edu:2443/SMT-LIB-benchmarks/QF NRA. 
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random integer between two integers down and up, and rp({zx1,..., £n}, d, m) 
denote a random polynomial 57"" , c;M; + co, where c; = rn(—1000, 1000) for 
0 < i < m, M; is a random monomial in {x{! --- x2" | a; E Z>0, ait---t+a, = d} 
and M; (2 < i < m) is a random monomial in {x{? ---x@" | a; € Z>0, a1 +*+ 
an < d}. 

A randomly generated polynomial formula rf ({v-n1, v-n2}, {p-n1, p-n2}, {d_, 
d4}, {n-, n4}, {Mm-, m4}, {cl-n1, cl-n2}, {cl-l1,cl-l2}), where all parameters are 
in Z>ọ, is constructed as follows: First, let n := rn(v-nı,v-n2) and gener- 
ate n variables z1,...,£n. Second, let num := rn(p_ni,p_n2) and generate 
num polynomials p1,...,Pnum: Every p; is a random polynomial rp({z,... 
Zin, jdm), where n; = rn(n-,n+), d = rn(d-, d4), m = rn(m_,m4+), and 
{£i -3 Zin, } are n; variables randomly selected from {21,...,2,}. Finally, let 
cl-n := rn(cl-n;, cl-nz) and generate cl_n clauses such that the number of atoms 
in a generated clause is rn(cl_l;, cl_lz). The rn(cl_l,cl_lz) atoms are randomly 
picked from {p; < 0,p; > 0,p; = 0 | 1 <i < num}. If some picked atom has 
the form p; = 0 and there exists a variable such that the degree of p; w.r.t. the 
variable is greater than 1, replace the atom with p; < 0 or p; > 0 with equal 
probability. We generate totally 500 random polynomial formulas according to 
rf ({30, 40}, {60, 80}, {20, 30}, {10, 20}, {20, 30}, {40, 60}, {3, 5}). 

The two classes of instances have different characteristics. The instances 
selected from SMT-LIB(NRA) usually contain lots of linear constraints, and their 
complexity is reflected in the propositional abstraction. For a random instance, 
all the polynomials in it are nonlinear and of high degrees, while its propositional 
abstraction is relatively simple. 


7.3 Experimental Results 


The experimental results are presented in Table 1. The column “#inst” records 
the number of instances. Let us first see Column “Z3”—Column “LS”. On 
instances from SMT-LIB(NRA), LS performs worse than all competitors except 
MathSAT5, but it is still comparable. It is crucial to note that our approach is 
much faster than both CVC5 and Z3 on 90% of the Meti-Tarski benchmarks of 
SMT-LIB (2194 instances in total). On random instances, only LS solved all of 
them, while the competitor Z3 with the best performance solved 29% of them. 
The results show that LS is not good at solving polynomial formulas with com- 
plex propositional abstraction and lots of linear constraints, but it has great 
ability to handle those with high-degree polynomials. A possible explanation is 
that as a local search solver, LS cannot exploit the propositional abstraction well 
to find a solution. However, for a formula with plenty of high-degree polynomials, 
cell-jump may ‘jump’ to a solution faster. 

The data revealed in the last column “LS+CVC5” of Table 1 indicates that 
the combination of LS and CVC5 manages to solve the majority of the instances 
across both classes, suggesting a complementary performance between LS and 
top-tier SMT(NRA) solvers. As shown in Table 2, when evaluating combinations 
of different solvers with LS, it becomes evident that our method significantly 
enhances the capabilities of existing solvers in the portfolio configurations. The 
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Table 1. Results on SMT-LIB(NRA) and random instances. 


#inst | Z3 CVC5 | Yices2 | MathSAT5 | LS LS+CVC5 
SMT-LIB(NRA) | 2736 | 2519 | 2563 | 2411 | 1597 2246 | 2602 
}meti-tarski 2194 | 2194/2155 | 2173 | 1185 2192 | 2193 
}20220314-Uncu 19 19 19 19 16 19 19 
Lother 523 | 306 | 389 219 396 35 |390 
Random instances 500 145 0 22 0 500 500 
Total 3236 | 2664 | 2563 | 2433 |1597 2746 | 3102 


most striking improvement can be witnessed in the “LS+MathSAT5” combina- 
tion, which demonstrates superior performance and the most significant enhance- 
ment among all the combination solvers. 


Table 2. Performance Comparison of Different Solver Combinations with LS. 


#inst | LS+Z3 | LS+CVC5 | LS+Yices2 | LS+MathSAT5 
SMT-LIB(NRA) |2736 | 2518 2602 2432 2609 
Lmeti-tarski 2194 | 2194 | 2193 2194 2191 
}20220314-Uncu 19 19 19 19 19 
Lother 523 | 305 390 219 399 
Random instances 500 500 500 500 500 
Total 3236 | 3018 3102 2932 3109 


Besides, Fig.6 shows the performance of LS and the competitors on all 
instances. The horizontal axis represents time, while the vertical axis represents 
the number of solved instances within the corresponding time. Figure 7 presents 
the run time comparisons between LS+CVC5 and CVC5. Every point in the 
figure represents an instance. The horizontal coordinate of the point is the com- 
puting time of LS+CVC5 while the vertical coordinate is the computing time 
of CVC5 (for every instance out of time, we record its computing time as 1200 
seconds). The figure shows that LS+CVC5 improves the performance of CVC5. 
We also present the run time comparisons between LS and each competitor in 
Figs. 8-11. 
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Fig.6. Number of solved instances Fig. 7. Comparing LS+CVC5 with 
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Fig. 10. Comparing LS with MathSAT5. Fig. 11. Comparing LS with Yices2. 


8 Conclusion 


For a given SMT(NRA) formula, although the domain of variables in the for- 
mula is infinite, the satisfiability of the formula can be decided through tests on 
a finite number of samples in the domain. A complete search on such samples 
is inefficient. In this paper, we propose a local search algorithm for a special 
class of SMT(NRA) formulas, where every equality polynomial constraint is 
linear with respect to at least one variable. The novelty of our algorithm con- 
tains the cell-jump operation and a two-level operation selection which guide the 
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algorithm to jump from one sample to another heuristically. The algorithm has 
been applied to two classes of benchmarks and the experimental results show 
that it is competitive with state-of-the-art SMT solvers and is good at solving 
those formulas with high-degree polynomial constraints. Tests on the solvers 
developed by combining this local search algorithm with Z3, CVC5, Yices2 or 
MathSAT5 indicate that the algorithm is complementary to these state-of-the- 
art SMT(NRA) solvers. For the future work, we will improve our algorithm such 
that it is able to handle all polynomial formulas. 
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Abstract. We study partial quantifier elimination (PQE) for proposi- 
tional CNF formulas with existential quantifiers. PQE is a generalization 
of quantifier elimination where one can limit the set of clauses taken out 
of the scope of quantifiers to a small subset of clauses. The appeal of PQE 
is that many verification problems (e.g., equivalence checking and model 
checking) can be solved in terms of PQE and the latter can be dramati- 
cally simpler than full quantifier elimination. We show that PQE can be 
used for property generation that one can view as a generalization of test- 
ing. The objective here is to produce an unwanted property of a design 
implementation, thus exposing a bug. We introduce two PQE solvers 
called EG-PQE and EG-PQE*. EG-PQE is a very simple SAT-based 
algorithm. EG-PQE™ is more sophisticated and robust than EG-PQE. 
We use these PQE solvers to find an unwanted property (namely, an 
unwanted invariant) of a buggy FIFO buffer. We also apply them to 
invariant generation for sequential circuits from a HWMCC benchmark 
set. Finally, we use these solvers to generate properties of a combinational 
circuit that mimic symbolic simulation. 


1 Introduction 


In this paper, we consider the following problem. Let F(X,Y) be a propositional 
formula in conjunctive normal form (CNF)! where X,Y are sets of variables. 
Let G be a subset of clauses of F. Given a formula 3X[F], find a quantifier-free 
formula H(Y) such that IX[F] = H A 3X |F \ G]. In contrast to full quantifier 
elimination (QE), only the clauses of G are taken out of the scope of quantifiers 
here. So, we call this problem partial QE (PQE) [1]. (In this paper, we consider 
PQE only for formulas with existential quantifiers.) We will refer to H as a 
solution to PQE. Like SAT, PQE is a way to cope with the complexity of QE. 
But in contrast to SAT that is a special case of QE (where all variables are 
quantified), PQE generalizes QE. The latter is just a special case of PQE where 
G = F and the entire formula is unquantified. Interpolation [2,3] can be viewed 
as a special case of PQE as well [4,5]. 


1 Every formula is a propositional CNF formula unless otherwise stated. Given a CNF 
formula F represented as the conjunction of clauses C1 A---AC, we will also consider 
F as the set of clauses {C1,..., Cx}. 
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The appeal of PQE is threefold. First, it can be much more efficient than 
QE if G is a small subset of F. Second, many verification problems like SAT, 
equivalence checking, model checking can be solved in terms of PQE [1,6-8]. So, 
PQE can be used to design new efficient methods for solving known problems. 
Third, one can apply PQE to solving new problems like property generation 
considered in this paper. In practice, to perform PQE, it suffices to have an 
algorithm that takes a single clause out of the scope of quantifiers. Namely, given 
a formula 1X [F'(X, Y)] and a clause C € F, this algorithm finds a formula H(Y) 
such that 3X|[F] = H AAX|F \ {C}]. To take out k clauses, one can apply this 
algorithm k times. Since H A3X[F] = HAAX|F \ {C}}], solving the PQE above 
reduces to finding H(Y) that makes C redundant in H ^A 3X[F]. So, the PQE 
algorithms we present here employ redundancy based reasoning. We describe 
two PQE algorithms called EG-PQE and EG-PQE* where “EG” stands for 
“Enumerate and Generalize”. EG-PQE is a very simple SAT-based algorithm 
that can sometimes solve very large problems. EG-PQE™ is a modification of 
EG-PQE that makes the algorithm more powerful and robust. 

In [7], we showed the viability of an equivalence checker based on PQE. In par- 
ticular, we presented instances for which this equivalence checker outperformed 
ABC [9], a high quality tool. In this paper, we describe and check experimen- 
tally one more important application of PQE called property generation. Our 
motivation here is as follows. Suppose a design implementation Imp meets the 
set of specification properties P,,..., Pm. Typically, this set is incomplete. So, 
Imp can still be buggy even if every P;,2 = 1,...,m holds. Let P7,,1,..., Pà 
be desired properties adding which makes the specification complete. If Imp 
meets the properties P;,..., Pm but is still buggy, a missed property P* above 
fails. That is, Imp has the unwanted property Pe So, one can detect bugs by 
generating unspecified properties of Imp and checking if there is an unwanted 
one. 

Currently, identification of unwanted properties is mostly done by massive 
testing. (As we show later, the input/output behavior specified by a single test 
can be cast as a simple property of Imp.) Another technique employed in prac- 
tice is guessing unwanted properties that may hold and formally checking if 
this is the case. The problem with these techniques is that they can miss an 
unwanted property. In this paper, we describe property generation by PQE. The 
benefit of PQE is that it can produce much more complex properties than those 
corresponding to single tests. So, using PQE one can detect bugs that testing 
overlooks or cannot find in principle. Importantly, PQE generates properties 
covering different parts of Imp. This makes the search for unwanted properties 
more systematic and facilitates discovering bugs that can be missed if one simply 
guesses unwanted properties that may hold. 

In this paper, we experimentally study generation of invariants of a sequen- 
tial circuit N. An invariant of N is unwanted if a state that is supposed to be 
reachable in N falsifies this invariant and hence is unreachable. Note that find- 
ing a formal proof that N has no unwanted invariants is impractical. (It is hard 
to efficiently prove a large set of states reachable because different states are 
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reached by different execution traces.) So developing practical methods for find- 
ing unwanted invariants if very important. We also study generation of properties 
mimicking symbolic simulation for a combinational circuit obtained by unrolling 
a sequential circuit. An unwanted property here exposes a wrong execution trace. 

This paper is structured as follows. (Some additional information can be 
found in the supporting technical report [5].) In Sect.2, we give basic defi- 
nitions. Section 3 presents property generation for a combinational circuit. In 
Sect. 4, we describe invariant generation for a sequential circuit. Sections 5 and 6 
present EG-PQE and EG-PQE* respectively. In Sect. 7, invariant generation is 
used to find a bug in a FIFO buffer. Experiments with invariant generation for 
HWMCC benchmarks are described in Sect. 8. Section 9 presents an experiment 
with property generation for combinational circuits. In Sect.10 we give some 
background. Finally, in Sect. 11, we make conclusions and discuss directions for 
future research. 


2 Basic Definitions 


In this section, when we say “formula” without mentioning quantifiers, we mean 
“a, quantifier-free formula”. 


Definition 1. We assume that formulas have only Boolean variables. A literal 
of a variable v is either v or its negation. A clause is a disjunction of literals. 
A formula F is in conjunctive normal form (CNF) if F = C1 A---A Cy where 
Ci,...,C are clauses. We will also view F as the set of clauses {C),...,Cx}. 
We assume that every formula is in CNF. 


Definition 2. Let F be a formula. Then Vars(F) denotes the set of variables 
of F and Vars(AX[F]) denotes Vars(F)\X. 


Definition 3. Let V be a set of variables. An assignment ¢ to V is a mapping 
V’ — {0,1} where V' CV. We will denote the set of variables assigned in ¢ as 
Vars(q). We will refer to ¢ as a full assignment to V if Vars(q) = V. We 
will denote as q C T the fact that a) Vars(q) C Vars(r) and b) every variable 
of Vars(q) has the same value in ¢ and T. 


Definition 4. A literal, a clause and a formula are said to be satisfied (respec- 
tively falsified) by an assignment ¢ if they evaluate to 1 (respectively 0) under 
q. 


Definition 5. Let C be a clause. Let H be a formula that may have quantifiers, 
and ¢ be an assignment to Vars(H). If C is satisfied by ¢, then Cy = 1. Oth- 
erwise, Cq is the clause obtained from C by removing all literals falsified by . 
Denote by Hq the formula obtained from H by removing the clauses satisfied by 
q and replacing every clause C unsatisfied by ¢ with Cg. 


Definition 6. Given a formula 1X[F(X,Y)], a clause C of F is called a quan- 
tified clause if Vars(C)N X # 0. If Vars(C)N X = 9, the clause C depends 
only on free, i.e., unquantified variables of F and is called a free clause. 
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Definition 7. Let G,H be formulas that may have existential quantifiers. We 
say that G,H are equivalent, written G = H, if Gq = Hg for all full assign- 
ments ¢ to Vars(G) U Vars(H). 


Definition 8. Let F(X,Y) be a formula and G C F and GF 9. The clauses 
of G are said to be redundant in AX|F] if IX[F] = IX[F \ G]. Note that if 
F \ G implies G, the clauses of G are redundant in 1X |F]. 


Definition 9. Given a formula 1X[F(X,Y))| and G where G C F, the Par- 
tial Quantifier Elimination (PQE) problem is to find H(Y) such that 
IX |F] = H ^A 3X[F \ G]. (So, PQE takes G out of the scope of quantifiers.) 
The formula H is called a solution to PQE. The case of PQE where G = F is 
called Quantifier Elimination (QE). 


Example 1. Consider the formula F = C1 AC2A\C3AC4 where C1 = %3V £4, Co= 
yiVx3, C3 = yı V T4, C4=y2V z4. Let Y denote {y1, y2} and X denote {x3, x4}. 
Consider the PQE problem of taking C4 out of IX[F], i.e., finding H(Y) such 
that 3IX[F] = H A3X[F \ {C1 }]. As we show later, IX[F] = yı AIX[F \ {C1}. 
That is, H=y, is a solution to the PQE problem above. 


Remark 1. Let D be a clause of a solution H to the PQE problem of Definition 9. 
If F \ G implies D, then H \ {D} is a solution to this PQE problem too. 


Proposition 1. Let H be a solution to the PQE problem of Definition 9. That 
is, IX[F] = H A 3X[F \ G]. Then F > H (i.e., F implies H). 


The proofs of propositions can be found in [5]. 


Definition 10. Let clauses C’,C” have opposite literals of exactly one variable 
w E Vars(C’)N Vars(C”). Then C’,C” are called resolvable on w. Let C be a 
clause of a formula G and w € Vars(C). The clause C is said to be blocked [10] 
in G with respect to the variable w if no clause of G is resolvable with C on w. 


Proposition 2. Let a clause C be blocked in a formula F(X,Y) with respect to 
a variable x € X. Then C is redundant in AX|F), i.e., IX[F \ {C} = AX [F). 


3 Property Generation by PQE 


Many known problems can be formulated in terms of PQE, thus facilitating the 
design of new efficient algorithms. In [5], we give a short summary of results 
on solving SAT, equivalence checking and model checking by PQE presented 
in [1,6-8]. In this section, we describe application of PQE to property generation 
for a combinational circuit. The objective of property generation is to expose a 
bug via producing an unwanted property. 

Let M(X,V,W) be a combinational circuit where X,V,W specify the sets 
of the internal, input and output variables of M respectively. Let F(X, V, W) 
denote a formula specifying M. As usual, this formula is obtained by Tseitin’s 
transformations [11]. Namely, F equals Fg, ^+- -A Fa, where Gi,...,G are the 
gates of M and Fg, specifies the functionality of gate Gi. 
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Example 2. Let G be a 2-input AND gate defined as 73 = zı A x2 where x3 
denotes the output value and 21,22 denote the input values of G. Then G is 
specified by the formula Fg = (T1 V T2 V x3) A (£1 V%3) A (a2 VF3). Every clause of 
Fg is falsified by an inconsistent assignment (where the output value of G is not 
implied by its input values). For instance, 21 V Z3 is falsified by the inconsistent 
assignment zı =0,2%3=1. So, every assignment satisfying Fg corresponds to a 
consistent assignment to G and vice versa. Similarly, every assignment satisfying 
the formula F above is a consistent assignment to the gates of M and vice versa. 


3.1 High-Level View of Property Generation by PQE 


One generates properties by PQE until an unwanted property exposing a bug 
is produced. (Like in testing, one runs tests until a bug-exposing test is encoun- 
tered.) The benefit of property generation by PQE is fourfold. First, by property 
generation, one can identify bugs that are hard or simply impossible to find by 
testing. Second, using PQE makes property generation efficient. Third, by tak- 
ing out different clauses one can generate properties covering different parts of 
the design. This increases the probability of discovering a bug. Fourth, every 
property generated by PQE specifies a large set of high-quality tests. 

In this paper (Sects. 7, 9), we consider cases where identifying an unwanted 
property is easy. However, in general, such identification is not trivial. A discus- 
sion of this topic is beyond the scope of this paper. (An outline of a procedure 
for deciding if a property is unwanted is given in [5].) 


3.2 Property Generation as Generalization of Testing 


The behavior of M corresponding to a single test can be cast as a property. Let 
wi E€ W be an output variable of M and Y be a test, i.e., a full assignment to 
the input variables V of M. Let B” denote the longest clause falsified by V, i.e., 
Vars( B”) = V. Let l(w;) be the literal satisfied by the value of w; produced by 
M under input V. Then the clause B” V I(w;) is satisfied by every assignment 
satisfying F, i.e., B” V l(w;) is a property of M. We will refer to it as a single- 
test property (since it describes the behavior of M for a single test). If the 
input v is supposed to produce the opposite value of w; (i.e., the one falsifying 
I(w;)), then Y exposes a bug in M. In this case, the single-test property above 
is an unwanted property of M exposing the same bug as the test V. 

A single-test property can be viewed as a weakest property of M as opposed 
to the strongest property specified by 4X [F]. The latter is the truth table of M 
that can be computed explicitly by performing QE on 4X[F]. One can use PQE 
to generate properties of M that, in terms of strength, range from the weakest 
ones to the strongest property inclusively. (By combining clause splitting with 
PQE one can generate single-test properties, see the next subsection.) Consider 
the PQE problem of taking a clause C out of 3X[F]. Let H(V,W) be a solution 
to this problem, i.e., IX[F] = H AAX|[F \ {C}. Since H is implied by F, it can 
be viewed as a property of M. If H is an unwanted property, M has a bug. 
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(Here we consider the case where a property of M is obtained by taking a clause 
out of formula 3X[F] where only the internal variables of M are quantified. 
Later we consider cases where some external variables of M are quantified too.) 
We will assume that the property H generated by PQE has no redundant 
clauses (see Remark 1). That is, if D € H, then F \ {C} Æ D. Then one can 
view H as a property that holds due to the presence of the clause C in F. 


3.3 Computing Properties Efficiently 


If a property H is obtained by taking only one clause out of 1X [F], its computa- 
tion is much easier than performing QE on 3X[F]. If computing H still remains 
too time-consuming, one can use the two methods below that achieve better 
performance at the expense of generating weaker properties. The first method 
applies when a PQE solver forms a solution incrementally, clause by clause (like 
the algorithms described in Sects. 5 and 6). Then one can simply stop computing 
H as soon as the number of clauses in H exceeds a threshold. Such a formula H 
is still implied by F and hence specifies a property of M. 

The second method employs clause splitting. Here we consider clause splitting 
on input variables v1,...,Up, i.e., those of V (but one can split a clause on 
any subset of variables from Vars(F)). Let F’ denote the formula F where a 
clause C is replaced with p+ 1 clauses: C1 = C V I(vj),..., Cp = CV U(Up), 
Cp+1 = CVI(v1) V-+-VU(vp), where I(v;) is a literal of v;. The idea is to obtain a 
property H by taking the clause C,+1 out of 3X [F"] rather than C out of 4X [F]. 
The former PQE problem is simpler than the latter since it produces a weaker 
property H. One can show that if {u1,...,vp}=V, then a) the complexity of 
PQE reduces to linear; b) taking out Cp+ı actually produces a single-test 
property. The latter specifies the input/output behavior of M for the test 7 
falsifying the literals (v1),...,1(up). (The details can be found in [5].) 


3.4 Using Design Coverage for Generation of Unwanted Properties 


Arguably, testing is so effective in practice because one verifies a particular 
design. Namely, one probes different parts of this design using some coverage 
metric rather than sampling the truth table (which would mean verifying every 
possible design). The same idea works for property generation by PQE for the 
following two reasons. First, by taking out a clause, PQE generates a property 
inherent to the specific circuit M. (If one replaces M with an equivalent but 
structurally different circuit, PQE will generate different properties.) Second, by 
taking out different clauses of F one generates properties corresponding to dif- 
ferent parts of M thus “covering” the design. This increases the chance to take 
out a clause corresponding to the buggy part of M and generate an unwanted 
property. 


3.5 High-Quality Tests Specified by a Property Generated by PQE 


In this subsection, we show that a property H generated by PQE, in general, 
specifies a large set of high-quality tests. Let H(V,W) be obtained by taking C 


116 E. Goldberg 


out of IX |F(X,V,W)]. Let Q(V,W) be a clause of H. As mentioned above, we 
assume that F \ {C} Æ Q. Then there is an assignment (7, V,w) satisfying 
formula (F \ {C}) A Q where ?, 0, Ù are assignments to X,V,W respectively. 
(Note that by definition, (¥,w) falsifies Q.) Let (#*, U,w*) be the execution 
trace of M under the input V. So, (#*, V, W*) satisfies F. Note that the output 
assignments Ù and w* must be different because (7, w*) has to satisfy Q. (Oth- 
erwise, (2*, V, w*) satisfies F ^A Q and so F Æ Q and hence F Æ H.) So, one 
can view UV as a test “detecting” disappearance of the clause C from F. Note 
that different assignments satisfying (F \ {C}) A Q correspond to different tests 
V. So, the clause Q of H, in general, specifies a very large number of tests. One 
can show that these tests are similar to those detecting stuck-at faults and so 
have very high quality [5]. 


4 Invariant Generation by PQE 


In this section, we extend property generation for combinational circuits to 
sequential ones. Namely, we generate invariants. Note that generation of desired 
auxiliary invariants is routinely used in practice to facilitate verification of a 
predefined property. The problem we consider here is different in that our goal 
is to produce an unwanted invariant exposing a bug. We picked generation of 
invariants (over that of weaker properties just claiming that a state cannot be 
reached in k transitions or less) because identification of an unwanted invariant 
is, arguably, easier. 


4.1 Bugs Making States Unreachable 


Let N be a sequential circuit and S denote the state variables of N. Let I(S) 
specify the initial state Sini (i.e. (Simi) =1). Let T(S’, V, S”) denote the tran- 
sition relation of N where S’, S” are the present and next state variables and 
V specifies the (combinational) input variables. We will say that a state 3 of 
N is reachable if there is an execution trace leading to 3. That is, there is 
a sequence of states 5g,...,5, where 89 = Sini, Sk = S and there exist V; 
i = 0,...,k—1 for which T(S}, vi, i41) = 1. Let N have to satisfy a set of 
invariants Po(S),...,Pm(S). That is, P; holds iff P;(#) = 1 for every reach- 
able state s of N. We will denote the aggregate invariant Py A --- A Pm as 
Pagg: We will call 7 a bad state of N if Pagg( S) = 0. If Pagg holds, no bad 
state is reachable. We will call & a good state of N if Pagg( S) = 1. 

Typically, the set of invariants Pọ,..., Pm is incomplete in the sense that it 
does not specify all states that must be unreachable. So, a good state can well 
be unreachable. We will call a good state operative (or op-state for short) if 
it is supposed to be used by N and so should be reachable. We introduce the 
term an operative state just to factor out “useless” good states. We will say that 
N has an op-state reachability bug if an op-state is unreachable in N. In 
Sect. 7, we consider such a bug in a FIFO buffer. The fact that Pagg holds says 
nothing about reachability of op-states. Consider, for instance, a trivial circuit 
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Niriv that simply stays in the initial state Sin; and Pagg( Sini) = 1. Then Pagg 
holds for Niriv but the latter has op-state reachability bugs (assuming that the 
correct circuit must reach states other than Sini). 

Let Rẹ(S) be the predicate satisfied only by a state 7. In terms of CTL, 
identifying an op-state reachability bug means finding 7 for which the property 
EF.Rz must hold but it does not. The reason for assuming 7 to be unknown 
is that the set of op-states is typically too large to explicitly specify every prop- 
erty ET.Rz to hold. This makes finding op-state reachability bugs very hard. 
The problem is exacerbated by the fact that reachability of different states is 
established by different traces. So, in general, one cannot efficiently prove many 
properties EF.R-y (for different states) at once. 


4.2 Proving Operative State Unreachability by Invariant 
Generation 


In practice, there are two methods to check reachability of op-states for large 
circuits. The first method is testing. Of course, testing cannot prove a state 
unreachable, however, the examination of execution traces may point to a poten- 
tial problem. (For instance, after examining execution traces of the circuit Niriv 
above one realizes that many op-states look unreachable.) The other method 
is to check unwanted invariants, i.e., those that are supposed to fail. If an 
unwanted invariant holds for a circuit, the latter has an op-state reachability 
bug. For instance, one can check if a state variable s; € S of a circuit never 
changes its initial value. To break this unwanted invariant, one needs to find 
an op-state where the initial value of s; is flipped. (For the circuit Miri, above 
this unwanted invariant holds for every state variable.) The potential unwanted 
invariants are formed manually, i.e., simply guessed. 

The two methods above can easily overlook an op-state reachability bug. 
Testing cannot prove that an op-state is unreachable. To correctly guess an 
unwanted invariant that holds, one essentially has to know the underlying bug. 
Below, we describe a method for invariant generation by PQE that is based on 
property generation for combinational circuits. The appeal of this method is 
twofold. First, PQE generates invariants “inherent” to the implementation at 
hand, which drastically reduces the set of invariants to explore. Second, PQE is 
able to generate invariants related to different parts of the circuit (including the 
buggy one). This increases the probability of generating an unwanted invariant. 
We substantiate this intuition in Sect. 7. 

Let formula Fk specify the combinational circuit obtained by unfolding a 
sequential circuit N for k time frames and adding the initial state constraint 
I(So). That is, Fk = I(So) A T(So, Vo, S1) A+++ A T(Sk-1, Ve-1, Sk) where Sj, V; 
denote the state and input variables of j-th time frame respectively. Let H(S;) 
be a solution to the PQE problem of taking a clause C out of 4X;,[F;] where 
Xk = SoUVOU-: -US,_1UV,_1. That is, AX, [Fx] = HA AX, [Fk {C}. Note that 
in contrast to Sect. 3, here some external variables of the combinational circuit 
(namely, the input variables Vọ,...,Vķ—1) are quantified too. So, H depends only 
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on state variables of the last time frame. H can be viewed as a local invariant 
asserting that no state falsifying H can be reached in k transitions. 

One can use H to find global invariants (holding for every time frame) as 
follows. Even if H is only a local invariant, a clause Q of H can be a global 
invariant. The experiments of Sect. 8 show that, in general, this is true for many 
clauses of H. (To find out if Q is a global invariant, one can simply run a model 
checker to see if the property Q holds.) Note that by taking out different clauses 
of Fk one can produce global single-clause invariants Q relating to different parts 
of N. From now on, when we say “an invariant” without a qualifier we mean a 
global invariant. 


5 Introducing EG-PQE 


In this section, we describe a simple SAT-based algorithm for performing PQE 
called EG-PQE. Here ‘EG’ stands for ‘Enumerate and Generalize’. EG-PQE 
accepts a formula 1X|[F(X,Y)] and a clause C € F. It outputs a formula H(Y) 
such that 3X [Fini] = H A AX[Fini \ {C}] where Fini is the initial formula F. 
(This point needs clarification because EG-PQE changes F by adding clauses.) 


5.1 An Example 


Before describing the pseudocode of EG-PQE, we explain how it solves the PQE 
problem of Example 1. That is, we consider taking clause C; out of IX[F (X,Y )] 
where F = C1 A---A C4, Cy = T3 V £4, Co=y1V23, C3 = Yı V T4, C4 =yY2 V T4 
and Y = {y1, yo} and X = {z3, x4}. 

EG-PQE iteratively generates a full assignment # to Y and checks if (C1)y 
is redundant in 4X [Fy] (i.e., if Cy is redundant in 3X[F] in subspace 7). Note 
that if (F \{Ci})y implies (C1)y, then (C1), is trivially redundant in 3X [Fy]. 
To avoid such subspaces, EG-PQE generates y by searching for an assignment 
(7,27) satisfying the formula (F\{C1}) AC. (Here y and 7 are full assignments 
to Y and X respectively.) If such (7, 2) exists, it satisfies F \ {C1} and falsifies 
Cı thus proving that (F \ {Ci })y does not imply (C1),. 

Assume that EG-PQE found an assignment(y = 0, y2 = 1, £3 = 1, x4 = 0) 
satisfying (F\{Ci}) AC. So Y = (yi: =0, yo=1). Then EG-PQE checks if Fy is 
satisfiable. Fy = (%3Vx4)A\x3A£q and so it is unsatisfiable. This means that (C1)y 
is not redundant in 4X|[F,]. (Indeed, (F \ {Ci})y is satisfiable. So, removing 
Cı makes F satisfiable in subspace y.) EG-PQE makes (Ci), redundant in 
JX[F,] by adding to F a clause B falsified by 7. The clause B equals yı 
and is obtained by identifying the assignments to individual variables of Y that 
made Fy unsatisfiable. (In our case, this is the assignment y; = 0.) Note that 
derivation of clause yı generalizes the proof of unsatisfiability of F in subspace 
(yi =0, y2=1) so that this proof holds for subspace (yı =0, y2=0) too. 

Now EG-PQE looks for a new assignment satisfying (F \{C1}) \ C4. Let the 
assignment (yı = 1,y2 = 1,23 = 1,24 = 0) be found. So, Y = (yı = 1, y2 = 1). 
Since (yi =1, y2=1,23 = 0) satisfies F, the formula Fy is satisfiable. So, (Ci), 
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is already redundant in 1X [Fy]. To avoid re-visiting the subspace y, EG-PQE 
generates the plugging clause D = 7; V Ya falsified by y. 

EG-PQE fails to generate a new assignment y because the formula 
DA (F \ {C1}) A Cy is unsatisfiable. Indeed, every full assignment Y we have 
examined so far falsifies either the clause yı added to F or the plugging clause 
D. The only assignment EG-PQE has not explored yet is y = (y1 = 1, y2 = 0). 
Since (F \ {Ci})y = v4 and (Ci)y = T3 V x4, the formula (F \ {Ci}) A Ci is 
unsatisfiable in subspace 7. In other words, (C1)y is implied by (F \{C1})y and 


hence is redundant. Thus, C4 is redundant in IX [Fini A yi] for every assignment 
to Y where Fini is the initial formula F. That is, IX [Fini] = y1^ SX [Fini \{C1}] 
and so the clause yı is a solution H to our PQE problem. 


5.2 Description of EG-PQE 


EG-PQE(F,X,Y,C) { The pseudo-code of EG-PQE is shown in 
1 Plg:=0; Fini = F Fig. 1. EG-PQE starts with storing the ini- 
2 while (true) { tial formula F and initializing formula Plg 
se \ {C} _ that accumulates the plugging clauses gener- 
t y = Satı (PlgA GAC) ated by EG-PQE (line 1). As we mentioned 
5 if (y= nil in the previous subsection, plugging clauses 
6 return(F’ \ Fini) ; we 

Eoi = are used to avoid re-visiting the subspaces 
7 (#*,B) := Sate(F, F) : : 
8 if (B£ nil) { where the formula F is proved satisfiable. 
s F:=FU{B} All the work is carried out in a while 
10 continue } loop. First, E G-PQE checks if there is a new 
11 D:=PlugCls(¥,2",F) subspace y where 3X[(F \ {C}),] does not 
i2 Plg := Plg U{D}}} imply Fy. This is done by searching for an 


assignment (y, £) satisfying Plg\(F\{C})A 

Fig. 1. Pseudocode of EG-PQE C (lines 3-4). If such an assignment does not 

exist, the clause C is redundant in 4X[F). 

(Indeed, let y be a full assignment to Y. 

The formula Plg A (F'\{C}) AC is unsatisfiable in subspace ¥ for one of the two 

reasons. First, y falsifies Plg. Then Cy is redundant because Fy is satisfiable. 

Second, (F \ {C}), A Cy is unsatisfiable. In this case, (F \ {C})y implies Cy.) 

Then EG-PQE returns the set of clauses added to the initial formula F as a 
solution H to the PQE problem (lines 5-6). 

If the satisfying assignment (7,27) above exists, EG-PQE checks if the for- 
mula Fy is satisfiable (line 7). If not, then the clause Cy is not redundant in 
AX[Fy] (because (F \ {C}), is satisfiable). So, EG-PQE makes Cy redundant 
by generating a clause B(Y) falsified by y and adding it to F (line 9). Note 
that adding B also prevents EG-PQE from re-visiting the subspace y again. 
The clause B is built by finding an unsatisfiable subset of Fy and collecting the 
literals of Y removed from clauses of this subset when obtaining Fy from F. 

If Fy is satisfiable, EG-PQE generates an assignment 7* to X such that 
(7, Z*) satisfies F (line 7). The satisfiability of Fy means that every clause 
of Fy including Cy is redundant in 4X[F,]. At this point, EG-PQE uses the 
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longest clause D(Y) falsified by 7 as a plugging clause (line 11). The clause D is 
added to Plg to avoid re-visiting subspace y. Semen es it is possible to remove 
variables from 7 to produce a shorter assignment y* such that (7*, #*) still 
satisfies F. Then one can use a shorter plugging nen D that is falsified by 7/* 
and involves only the variables assigned in y*. 


5.3 Discussion 


EG-PQE is similar to the QE algorithm presented at CAV-2002 [12]. We will 
refer to it as CAV02-QE. Given a formula 3X [F'(X, Y)], CAV02-QE enumerates 
full assignments to Y. In subspace 7, if Fy is unsatisfiable, CAV02-QE adds 
to F a clause falsified by Y. Otherwise, CAV02-QE generates a plugging clause 
D. (In [12], D is called “a blocking clause”. This term can be confused with the 
term “blocked clause” specifying a completely different kind of a clause. So, we 
use the term “the plugging clause” instead.) To apply the idea of CAV02-QE 
to PQE, we reformulated it in terms of redundancy based reasoning. 

The main flaw of EG-PQE inherited from CAV02-QE is the necessity to 
use plugging clauses produced from a satisfying assignment. Consider the PQE 
problem of taking a clause C out of 1X|[F(X,Y)]. If F is proved unsatisfiable in 
subspace 7, typically, only a small subset of clauses of Fy is involved in the proof. 
Then the clause generated by EG-PQE is short and thus proves C redundant 
in many subspaces different from y. On the contrary, to prove F satisfiable 
in subspace y, every clause of F must be satisfied. So, the plugging clause 
built off a satisfying assignment includes almost every variable of Y. Despite 
this flaw of EG-PQE, we present it for two reasons. First, it is a very simple 
SAT-based algorithm that can be easily implemented. Second, EG-PQE has 
a powerful advantage over CAV02-QE since it solves PQE rather than QE. 
Namely, EG-PQE does not need to examine the subspaces Y where C is implied 
by F \ {C}. Surprisingly, for many formulas this allows EG-PQE to completely 
avoid examining subspaces where F is satisfiable. In this case, EG-PQE is very 
efficient and can solve very large problems. Note that when CAV02-QE performs 
complete QE on 3X[F], it cannot avoid subspaces Y} where Fy is satisfiable 
unless F itself is unsatisfiable (which is very rare in practical applications). 


6 Introducing EG-PQE* 
In this section, we describe EG-PQE™, an improved version of EG-PQE. 


6.1 Main Idea 


The pseudocode of EG-PQE™ is shown in Fig.2. It is different from that of 
EG-PQE only in line 11 marked with an asterisk. The motivation for this change 
is as follows. Line 11 describes proving redundancy of C for the case where Cy 
is not implied by (F \{C})y and Fy is satisfiable. Then EG-PQE simply uses a 
satisfying assignment as a proof of redundancy of C in subspace y. This proof 
is unnecessarily strong because it proves that every clause of F (including C) is 
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redundant in 4X[F] in subspace y. Such a strong proof is hard to generalize to 
other subspaces. 

The idea of EG-PQE* is to generate a 

proof for a much weaker proposition namely 

: a proof of redundancy of C (and only C). 

a. mae raei Intuitively, such a ae oe be a 

i" D:=PrvClsRed(¥,F,C) to generalize. So, EG-PQE™ calls a pro- 

12 Plg := Plg U{D}}} cedure PruvClsRed generating such a proof. 

EG-PQE* is a generic algorithm in the sense 

Fig. 2. Pseudocode of EG-PQE+ that any suitable procedure can be employed 

as PrvuClsRed. In our current implementa- 

tion, the procedure DS-PQE [1] is used as 

PrvClsRed. DS-PQE generates a proof stating that C is redundant in 3X[F] 

in subspace y* C y. Then the plugging clause D falsified by 7/* is generated. 

Importantly, 7* can be much shorter than y. (A brief description of DS-PQE 
in the context of EG-PQE™ is given in [5].) 


EG-PQE*(F,X,Y,C) { 
1 Plg := b; Fim := F 


Example 3. Consider the example solved in Subsect. 5.1. That is, we consider 
taking clause Cı out of IX[F(X,Y)]| where F = C1 A++- A C4, Ci = T3 V “a, 
C2 =y V T3, C3 = Yı V T4, Ca = Y2 V T4 and Y = {y1, yo} and X = {x3, 24}. 
Consider the step where EG-PQE proves redundancy of C4 in subspace Y = 
(yı =1,y2 = 1). EG-PQE shows that (yı = 1, y2 = 1,£3 = 0) satisfies F, thus 
proving every clause of F (including C1) redundant in 1X[F] in subspace 7. 
Then EG-PQE generates the plugging clause D = 9, V Yo falsified by y. 

In contrast to EG-PQE, EG-PQE* calls PrvClsRed to produce a proof of 
redundancy for the clause Cı alone. Note that F has no clauses resolvable with 
Cı on x3 in subspace Y* = (yı = 1). (The clause C2 containing 23 is satisfied by 
y*.) This means that C4 is blocked in subspace 7* and hence redundant there 
(see Proposition 2). Since 7* C Y, EG-PQE* produces a more general proof of 
redundancy than EG-PQE. To avoid re-examining the subspace 7/*, EG-PQE* 
generates a shorter plugging clause D = 7. 


6.2 Discussion 


Consider the PQE problem of taking a clause C out of 3X |[F(X,Y)]. There are 
two features of PQE that make it easier than QE. The first feature mentioned 
earlier is that one can ignore the subspaces Y where F \ {C} implies C. The sec- 
ond feature is that when Fy is satisfiable, one only needs to prove redundancy of 
the clause C alone. Among the three algorithms we run in experiments, namely, 
DS-PQE, EG-PQE and EG-PQE* only the latter exploits both features. (In 
addition to using DS-PQE inside EG-PQE* we also run it as a stand-alone 
PQE solver.) DS-PQE does not use the first feature [1] and EG-PQE does not 
exploit the second one. As we show in Sects. 7 and 8, this affects the performance 
of DS-PQE and EG-PQE. 
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7 Experiment with FIFO Buffers 


In this and the next two sections we describe some experiments with 
DS-PQE, EG-PQE and EG-PQE* (their sources are available at [13,14] 
and [15] respectively). We used Minisat2.0 [16] as an internal SAT-solver. The 
experiments were run on a computer with Intel Core i5-8265U CPU of 1.6 GHz. 

In this section, we give an example of bug 


if (write == 1 && currSize <n) detection by invariant generation for a FIFO 
* if (dataln != Val) buffer. Our objective here is threefold. First, 
begin we want to give an example of a bug that 
Data|wrPnt] = dataln; can be overlooked by testing and guessing 


wrPnt = wrPnt + 1; 


i the unwanted properties to check (see Sub- 
eni 


sect. 7.3). Second, we want to substantiate the 
intuition of Subsect. 3.4 that property genera- 
Fig. 3. A buggy fragment of Ver- tion by PQE (in our case, invariant generation 
ilog code describing Fifo by PQE) has the same reasons to be effective 

as testing. In particular, by taking out differ- 

ent clauses one generates invariants relating to 
different parts of the design. So, taking out a clause of the buggy part is likely 
to produce an unwanted invariant. Third, we want to give an example of an 
invariant that can be easily identified as unwanted?. 


7.1 Buffer Description 


Consider a FIFO buffer that we will refer to as Fifo. Let n be the number of 
elements of Fifo and Data denote the data buffer of Fifo. Let each Dataļi], i = 
1,...,n have p bits and be an integer where 0 < Dataļi] < 2P. A fragment of the 
Verilog code describing Fifo is shown in Fig. 3. This fragment has a buggy line 
marked with an asterisk. In the correct version without the marked line, a new 
element dataln is added to Data if the write flag is on and Fifo has less than 
n elements. Since Data can have any combination of numbers, all Data states 
are supposed to be reachable. However, due to the bug, the number Val cannot 
appear in Data. (Here Val is some constant 0 < Val < 2?. We assume that the 
buffer elements are initialized to 0.) So, Fifo has an op-state reachability bug 
since it cannot reach operative states where an element of Data equals Val. 


? Let P(S) be an invariant for a circuit N depending only on a subset ĝ of the state 
variables S. Identifying P as an unwanted invariant is much easier if S is meaningful 
from the high-level view of the design. Suppose, for instance, that assignments to 8 
specify values of a high-level variable v. Then P is unwanted if it claims unreachabil- 
ity of a value of v that is supposed to be reachable. Another simple example is that 
assignments to ô specify values of high-level variables v and w that are supposed to 
be independent. Then P is unwanted if it claims that some combinations of values of 
v and w are unreachable. (This may mean, for instance, that an assignment operator 
setting the value of v erroneously involves the variable w.) 


Partial Quantifier Elimination and Property Generation 123 


7.2 Bug Detection by Invariant Generation 


Let N be a circuit implementing Fifo. Let S be the set of state variables of N 
and Sdata C S be the subset corresponding to the data buffer Data. We used 
DS-PQE, EG-PQE and EG-PQE* to generate invariants of N as described 
in Sect. 4. Note that an invariant Q depending only on Sdata is an unwanted 
one. If Q holds for N, some states of Data are unreachable. Then Fifo has an 
op-state reachability bug since every state of Data is supposed to be reachable. 
To generate invariants, we used the formula Fy = I (So) A T(So, Vo, S1) AA 
T(Sk—-1, Ve-1, Sk) introduced in Subsect.4.2. Here J and T describe the initial 
state and the transition relation of N respectively and S; and V; denote state 
variables and combinational input variables of j-th time frame respectively. First, 
we used a PQE solver to generate a local invariant H(S;,) obtained by taking a 
clause C out of 4X;,[F;] where Xp = So U Vo U ++- U Sk-1 U Vk—1. So, AX [Fe] = 
HA 3Xr|Fp \ {C}]. (Since Fi, = H, no state falsifying H can be reached in k 
transitions.) In the experiment, we took out only clauses of Fẹ containing an 
unquantified variable, i.e., a state variable of the k-th time frame. The time limit 
for solving the PQE problem of taking out a clause was set to 10s. 


Table 1. FIFO buffer with n elements of 32 bits. Time limit is 10s per PQE problem 


buff. | lat- | time total pge probs finished pge probs | unwant. invar | runtime (s.) 
size | ches | fra- | ds- eg- eg- ds- | eg- | eg- ds- | eg- | eg- | ds- eg- eg- 
n mes pge |pqe  pget pge|pqe |pqet |pqe pqe | pqet | pqe | pqe | pqet 


8 |300 | 5 | 1,236 311 8 |2% 36% 35% no |yes|yes |12,141| 2,138) 52 
8 |300 | 10 560| 737/39 2%| 1% | 3% |yes yes|yes | 5,551 7,681 380 
16 560 | 5 2,288 | 2,288 16 1% 65% 71% |no (no |yes | 22,612) 9,506 50 
16 560 | 10 653 | 2,288 24 1% 36% | 38% |yes no | yes 6,541 | 16,554 | 153 


For each clause Q of every local invariant H generated by PQE, we checked 
if Q was a global invariant. Namely, we used a public version of [C3 [17,18] to 
verify if the property Q held (by showing that no reachable state of N falsified Q). 
If so, and Q depended only on variables of Saata, N had an unwanted invariant. 
Then we stopped invariant generation. The results of the experiment for buffers 
with 32-bit elements are given in Table 1. When picking a clause to take out, 
i.e., a clause with a state variable of k-th time frame, one could make a good 
choice by pure luck. To address this issue, we picked clauses to take out randomly 
and performed 10 different runs of invariant generation and then computed the 
average value. So, the columns four to twelve of Table 1 actually give the average 
value of 10 runs. 

Let us use the first line of Table 1 to explain its structure. The first two 
columns show the number of elements in Fifo implemented by N and the number 
of latches in N (8 and 300). The third column gives the number k of time frames 
(i.e., 5). The next three columns show the total number of PQE problems solved 
by a PQE solver before an unwanted invariant was generated. For instance, 
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EG-PQE* found such an invariant after solving 8 problems. On the other hand, 
DS-PQE failed to find an unwanted invariant and had to solve all 1,236 PQE 
problems of taking out a clause of Fẹ with an unquantified variable. The following 
three columns show the share of PQE problems finished in the time limit of 10s. 
For instance, EG-PQE finished 36% of 311 problems. The next three columns 
show if an unwanted invariant was generated by a PQE solver. (EG-PQE and 
EG-PQE* found one whereas DS-PQE did not.) The last three columns give 
the total run time. Table 1 shows that only EG-PQE* managed to generate an 
unwanted invariant for all four instances of Fifo. This invariant asserted that 
Fifo cannot reach a state where an element of Data equals Val. 


7.3 Detection of the Bug by Conventional Methods 


The bug above (or its modified version) can be overlooked by conventional meth- 
ods. Consider, for instance, testing. It is hard to detect this bug by random tests 
because it is exposed only if one tries to add Val to Fifo. The same applies to 
testing using the line coverage metric [19]. On the other hand, a test set with 
100% branch coverage [19] will find this bug. (To invoke the else branch of the 
if statement marked with ‘*’ in Fig.3, one must set dataIn to Val.) However, a 
slightly modified bug can be missed even by tests with 100% branch coverage [5]. 

Now consider, manual generation of unwanted properties. It is virtually 
impossible to guess an unwanted invariant of Fifo exposing this bug unless one 
knows exactly what this bug is. However, one can detect this bug by checking 
a property asserting that the element dataln must appear in the buffer if Fifo 
is ready to accept it. Note that this is a non-invariant property involving states 
of different time frames. The more time frames are used in such a property the 
more guesswork is required to pick it. Let us consider a modified bug. Suppose 
Fifo does not reject the element Val. So, the non-invariant property above holds. 
However, if dataIn == Val, then Fifo changes the previous accepted element if 
that element was Val too. So, Fifo cannot have two consecutive elements Val. 
Our method will detect this bug via generating an unwanted invariant falsified by 
states with consecutive elements Val. One can also identify this bug by checking 
a property involving two consecutive elements of Fifo. But picking it requires a 
lot of guesswork and so the modified bug can be easily overlooked. 


8 Experiments with HWMCC Benchmarks 


In this section, we describe three experiments with 98 multi-property bench- 
marks of the HWMCC-13 set [20]. (We use this set because it has a multi- 
property track, see the explanation below.) The number of latches in those 
benchmarks range from 111 to 8,000. More details about the choice of bench- 
marks and the experiments can be found in [5]. Each benchmark consists of a 
sequential circuit N and invariants Po, ..., Pm to prove. Like in Sect. 4, we call 
Pagg = Po A ++- A Pm the aggregate invariant. In experiments 2 and 3 we used 
PQE to generate new invariants of N. Since every invariant P implied by Pagg 
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is a desired one, the necessary condition for P to be unwanted is Pagg # P. The 
conjunction of many invariants P; produces a stronger invariant Pagg, which 
makes it harder to generate P not implied by Pagg. (This is the reason for using 
multi-property benchmarks in our experiments.) The circuits of the HWMCC-13 
set are anonymous, so, we could not know if an unreachable state is supposed to 
be reachable. For that reason, we just generated invariants not implied by Pagg 
without deciding if some of them were unwanted. 

Similarly to the experiment of Sect.7, we used the formula Fy, = I(Sp) A 
T (So, Vo, $1) A++- A T(Sk-1, Ve-1, Sk) to generate invariants. The number k of 
time frames was in the range of 2< k <10. As in the experiment of Sect.7, we 
took out only clauses containing a state variable of the k-th time frame. In all 
experiments, the time limit for solving a PQE problem was set to 10s. 


8.1 Experiment 1 


In the first experiment, we generated a local invariant H by taking out a clause 
C of 3X;|Fk] where Xk = So U Vo U +-+- U Spa U Vk. The formula H asserts 
that no state falsifying H can be reached in k transitions. Our goal was to show 
that PQE can find H for large formulas Fk that have hundreds of thousands 
of clauses. We used EG-PQE to partition the PQE problems we tried into two 
groups. The first group consisted of 3,736 problems for which we ran EG-PQE 
with the time limit of 10s and it never encountered a subspace sz where Fẹ was 
satisfiable. Here sf is a full assignment to Sp. Recall that only the variables Sk 
are unquantified in 1X;[F;]. So, in every subspace sg, formula Fẹ was either 
unsatisfiable or (Fk \ {C}) = C. (The fact that so many problems meet the 
condition of the first group came as a big surprise.) The second group consisted of 
3,094 problems where EG-PQE encountered subspaces where Fy, was satisfiable. 

For the first group, DS-PQE finished only 30% of the problems within 10s 
whereas EG-PQE and EG-PQE* finished 88% and 89% respectively. The poor 
performance of DS-PQE is due to not checking if (Fi, \\{C}) > C in the current 
subspace. For the second group, DS-PQE, EG-PQE and EG-PQE* finished 
15%, 2% and 27% of the problems respectively within 10s. EG-PQE finished far 
fewer problems because it used a satisfying assignment as a proof of redundancy 
of C (see Subsect. 6.2). 

To contrast PQE and QE, we employed a high-quality tool CADET [21,22] 
to perform QE on the 98 formulas 3X;|F;] (one formula per benchmark). That 
is, instead of taking a clause out of 1X;,[F;] by PQE, we applied CADET to 
perform full QE on this formula. (Performing QE on 4X;,[F,] produces a formula 
H (Sx) specifying all states unreachable in k transitions.) CADET finished only 
25% of the 98 QE problems with the time limit of 600s. On the other hand, 
EG-PQE* finished 60% of the 6,830 problems of both groups (generated off 
JX;|F}]) within 10s. So, PQE can be much easier than QE if only a small part 
of the formula gets unquantified. 
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8.2 Experiment 2 


The second experiment was an extension of the first one. Its goal was to show 
that PQE can generate invariants for realistic designs. For each clause Q of a 
local invariant H generated by PQE we used ICS to verify if Q was a global 
invariant. If so, we checked if Pagg # Q held. To make the experiment less time 
consuming, in addition to the time limit of 10s per PQE problem we imposed 
a few more constraints. The PQE problem of taking a clause out of 1X,[F%| 
terminated as soon as H accumulated 5 clauses or more. Besides, processing 
a benchmark aborted when the summary number of clauses of all formulas H 
generated for this benchmark reached 100 or the total run time of all PQE 
problems generated off 1X;,[F},] exceeded 2,000 s. 
Table 2 shows the results of the exper- 
Table 2. Invariant generation iment. The third column gives the num- 
pqe | #bench|results ber of local single-clause invariants (i.e., 
Solver. marks: nee ale ee | N0 iMP the total number of clauses in all H over 
invar. | invar. |by Pagg 
Tepe 98 15.556 |2,678 2,309 all benchmarks). The fourth column shows 
eg-pqe 98 9,49814,839/4,009 how many local single-clause invariants 
eg-pqe* | 98 9,303 |4,773 |3,940 turned out to be global. (Since global 
invariants were extracted from H and the 
summary size of all H could not exceed 
100, the number of global invariants per benchmark could not exceed 100.) The 
last column gives the number of global invariants not implied by Pagg. So, these 
invariants are candidates for checking if they are unwanted. Table 2 shows that 
EG-PQE and EG-PQE* performed much better than DS-PQE. 


8.3 Experiment 3 


To prove an invariant P true, [C3 conjoins it with clauses Q1,...,Q, to make 
PA QiA-:+: A Qn inductive. If [C3 succeeds, every Q; is an invariant. More- 
over, Qi may be an unwanted invariant. The goal of the third experiment was to 
demonstrate that PQE and /C3, in general, produce different invariant clauses. 
The intuition here is twofold. First, JCS generates clauses Q; to prove a prede- 
fined invariant rather than find an unwanted one. Second, the closer P to being 
inductive, the fewer new invariant clauses are generated by [C'3. Consider the 
circuit Niriv that simply stays in the initial state Sin; (Sect. 4). Any invariant 
satisfied by Sini is already inductive for Niriwv. So, IC3 will not generate a single 
new invariant clause. On the other hand, if the correct circuit is supposed to 
leave the initial state, N;,;, has unwanted invariants that our method will find. 

In this experiment, we used [C3 to generate Pj,,, an inductive version of 
Pagg. The experiment showed that in 88% cases, an invariant clause generated 
by EG-PQE* and not implied by Pagg Was not implied by P%,, either. (More 
details about this experiment can be found in [5].) 


gg 
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9 Properties Mimicking Symbolic Simulation 


Let M(X,V,W) be a combinational circuit where X,V,W are internal, input 
and output variables. In this section, we describe generation of properties of M 
that mimic symbolic simulation [23]. Every such a property Q(V) specifies a 
cube of tests that produce the same values for a given subset of variables of W. 
We chose generation of such properties because deciding if Q is an unwanted 
property is, in general, simple. The procedure for generation of these properties 
is slightly different from the one presented in Sect. 3. 

Let F(X,V,W) be a formula specifying M. Let B(W) be a clause. Let H(V) 
be a solution to the PQE problem of taking a clause C € F out of IX3W[F ^ B]. 
That is, IXAW [F ^ B] = HA AXAW|(F’\ {C}) AB]. Let Q(V) be a clause of H. 
Then M has the property that for every full assignment V to V falsifying Q, 
it produces an output Ù falsifying B (a proof of this fact can be found in [5]). 
Suppose, for instance, Q =v: V U1gVv39 and B=w2V W40. Then for every Y where 
vı =0, v10 = 1,039 =0, the circuit M produces an output where ws = 0, wap = 1. 
Note that Q is implied by F A^ B rather than F. So, it is a property of M under 
constraint B rather than M alone. The property Q is unwanted if there is an 
input falsifying Q that should not produce an output falsifying B. 

To generate combinational circuits, we unfolded sequential circuits of the set 
of 98 benchmarks used in Sect. 8 for invariant generation. Let N be a sequential 
circuit. (We reuse the notation of Sect. 4). Let Mz(S0, Vo,..-,Sk—1, Vk-1, Sk) 
denote the combinational circuit obtained by unfolding N for k time frames. 
Here S}, Vj are state and input variables of j-th time frame respectively. Let Fy, 
denote the formula I(So) \ T(So, Vo, 1) A-:- AT(Sk-1, Vk-1, Sk) describing the 
unfolding of N for k time frames. Note that Fẹ specifies the circuit Mp above 
under the input constraint I(So). Let B(S) be a clause. Let H (So, Vo,..., Vi—1) 
be a solution to the PQE problem of taking a clause C € F out of formula 
AS 4 [Fe A BI. Here Sik = $,U---US,. That is, AS) 4 [Fe ^B] = H^ AS [Fp \ 
{C}) AB]. Let Q be a clause of H. Then for every assignment (Sini,Vo;--.;Uk—1) 
falsifying Q, the circuit Mg outputs Sp falsifying B. (Here Sini is the initial state 
of N and s% is a state of the last time frame.) 


Table 3. Property generation for combinational circuits In the experiment, we 
used DS-PQE,EG-PQE 

name |lat- |time|size subc. M; result and EG-PQE+ to solve 

ches |fra- of |gates inp min |max |time |3-val 

mes |B vars (s.) |sim. 1,586 PQE problems 

6s326 |3,342| 20 | 15 (348,479 1,774| 27 28 |2.9 | no described above. In Table 3, 
6s40m|5,608| 20 | 15 |406,474 3,450| 27 | 29 |1.1 | no we give a sample of 
6s250 |6,185| 20 | 15 [556,562 2,456| 50 | 54 [0.8 | no results by EG-PQEt. 
6s395 463| 30 15 36,088 569 | 24 | 26 |0.7 yes (More details about this 


6s339 |1,594| 30 | 15 |179,543 | 3,978 | 70 | 71 |3.1 | no 
6s292 |3,190| 30 | 15 |154,014| 978| 86 | 89 |1.1 | no 


experiment can be found 


6s143 | 260| 40 | 15 [551,019 16,689|526 530 |2.5 | yes in [5].) Below, we use 
68372 |1,124| 40 | 15 |295,626 2,766|513 |518 |1.7 | no the first line of Table3 
68335 |1,658| 40 | 15 [207,787  2,863/120 124 |6.7 | no to explain its structure. 


6s391 |2,686) 40 | 15 |240,825| 7,579 |340 |341 |8.9 | no 


The first column gives the 
benchmark name (63326). 
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The next column shows that 68326 has 3,342 latches. The third column gives 
the number of time frames used to produce a combinational circuit Mp (here 
k = 20). The next column shows that the clause B introduced above consisted of 
15 literals of variables from Sx. (Here and below we still use the index k assum- 
ing that k = 20.) The literals of B were generated randomly. When picking the 
length of B we just tried to simulate the situation where one wants to set a 
particular subset of output variables of Mp to specified values. The next two 
columns give the size of the subcircuit Mj, of Mp that feeds the output variables 
present in B. When computing a property H we took a clause out of formula 
3S1,|F;, A B] where Fi, specifies Mj, instead of formula 361,%|Fk ^ B] where Fk 
specifies Mp. (The logic of Mp not feeding a variable of B is irrelevant for com- 
puting H.) The first column of the pair gives the number of gates in Mj, (i.e., 
348,479). The second column provides the number of input variables feeding Mj, 
(i.e., 1,774). Here we count only variables of VoU---UV;,_1 and ignore those of So 
since the latter are already assigned values specifying the initial state Sing of N. 

The next four columns show the results of taking a clause out of 391, [ F; ^ 


B]. For each PQE problem the time limit was set to 10s. Besides, EG-PQE* 
terminated as soon as 5 clauses of property H(So, Vo,..., Ve-1) were generated. 
The first three columns out of four describe the minimum and maximum sizes 
of clauses in H and the run time of EG-PQE™. So, it took for EG-PQE* 2.9s. 
to produce a formula H containing clauses of sizes from 27 to 28 variables. A 
clause Q of H with 27 variables, for instance, specifies 2174" tests falsifying Q that 
produce the same output of Mj, (falsifying the clause B). Here 1747 = 1774 — 27 
is the number of input variables of Mj, not present in Q. The last column shows 
that at least one clause Q of H specifies a property that cannot be produced by 
3-valued simulation (a version of symbolic simulation [23]). To prove this, one 
just needs to set the input variables of Mj, present in Q to the values falsifying Q 
and run 3-valued simulation. (The remaining input variables of Mj, are assigned 
a don’t-care value.) If after 3-valued simulation some output variable of Mj, is 
assigned a don’t-care value, the property specified by Q cannot be produced by 
3-valued simulation. 

Running DS-PQE, EG-PQE and EG-PQE* on the 1,586 PQE problems 
mentioned above showed that a) EG-PQE performed poorly producing proper- 
ties only for 28% of problems; b) DS-PQE and EG-PQE* showed much better 
results by generating properties for 62% and 66% of problems respectively. When 
DS-PQE and EG-PQE* succeeded in producing properties, the latter could not 
be obtained by 3-valued simulation in 74% and 78% of cases respectively. 


10 Some Background 


In this section, we discuss some research relevant to PQE and property genera- 
tion. Information on BDD based QE can be found in [24,25]. SAT based QE is 
described in [12,21,26-32]. Our first PQE solver called DS-PQE was introduced 
in [1]. It was based on redundancy based reasoning presented in [33] in terms of 
variables and in [34] in terms of clauses. The main flaw of DS-PQE is as follows. 
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Consider taking a clause C out of 1X [F']. Suppose DS-PQE proved C redundant 
in a subspace where F is satisfiable and some quantified variables are assigned. 
The problem is that DS-PQE cannot simply assume that C is redundant every 
time it re-enters this subspace [35]. The root of the problem is that redundancy 
is a structural rather than semantic property. That is, redundancy of a clause in 
a formula € (quantified or not) does not imply such redundancy in every formula 
logically equivalent to €. Since our current implementation of EG-PQE* uses 
DS-PQE as a subroutine, it has the same learning problem. We showed in [36] 
that this problem can be addressed by the machinery of certificate clauses. So, 
the performance of PQE can be drastically improved via enhanced learning in 
subspaces where F is satisfiable. 

We are unaware of research on property generation for combinational cir- 
cuits. As for invariants, the existing procedures typically generate some auxiliary 
desired invariants to prove a predefined property (whereas our goal is to generate 
invariants that are unwanted). For instance, they generate loop invariants [37] 
or invariants relating internal points of circuits checked for equivalence [38]. 
Another example of auxiliary invariants are clauses generated by [C3 to make 
an invariant inductive [17]. As we showed in Subsect. 8.3, the invariants produced 
by PQE are, in general, different from those built by [C3. 


11 Conclusions and Directions for Future Research 


We consider Partial Quantifier Elimination (PQE) on propositional CNF formu- 
las with existential quantifiers. In contrast to complete quantifier elimination, 
PQE allows to unquantify a part of the formula. We show that PQE can be 
used to generate properties of combinational and sequential circuits. The goal of 
property generation is to check if a design has an unwanted property and thus 
is buggy. We used PQE to generate an unwanted invariant for a FIFO buffer 
exposing a non-trivial bug. We also applied PQE to invariant generation for 
HWMCC benchmarks. Finally, we used PQE to generate properties of combina- 
tional circuits mimicking symbolic simulation. Our experiments show that PQE 
can efficiently generate properties for realistic designs. 

There are at least three directions for future research. The first direction 
is to improve the performance of PQE solving. As we mentioned in Sect. 10, 
the most promising idea here is to enhance the power of learning in subspaces 
where the formula is satisfiable. The second direction is to use the improved 
PQE solvers to design new, more efficient algorithms for well-known problems 
like SAT, model checking and equivalence checking. The third direction is to 
look for new problems that can be solved by PQE. 
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Abstract. The problem of model counting, also known as #SAT, is 
to compute the number of models or satisfying assignments of a given 
Boolean formula F. Model counting is a fundamental problem in com- 
puter science with a wide range of applications. In recent years, there has 
been a growing interest in using hashing-based techniques for approx- 
imate model counting that provide (¢,6)-guarantees: i.e., the count 
returned is within a (1 + ¢)-factor of the exact count with confidence 
at least 1 — 6. While hashing-based techniques attain reasonable scala- 
bility for large enough values of 6, their scalability is severely impacted 
for smaller values of 6, thereby preventing their adoption in application 
domains that require estimates with high confidence. 

The primary contribution of this paper is to address the Achilles 
heel of hashing-based techniques: we propose a novel approach 
based on rounding that allows us to achieve a significant reduc- 
tion in runtime for smaller values of 6. The resulting counter, called 
ApproxMC6 (The resulting tool ApproxMCé6 is available open-source at 
https: //github.com/meelgroup/approxmc), achieves a substantial run- 
time performance improvement over the current state-of-the-art counter, 
ApproxMC. In particular, our extensive evaluation over a benchmark suite 
consisting of 1890 instances shows ApproxMC6 solves 204 more instances 
than ApproxMC, and achieves a 4x speedup over ApproxMC. 


1 Introduction 


Given a Boolean formula F, the problem of model counting is to compute the 
number of models of F. Model counting is a fundamental problem in computer 
science with a wide range of applications, such as control improvisation [13], 
network reliability [9,28], neural network verification [2], probabilistic reason- 
ing [5, 11,20, 21], and the like. In addition to myriad applications, the problem of 
model counting is a fundamental problem in theoretical computer science. In his 
seminal paper, Valiant showed that #SAT is #P-complete, where #P is the set 
of counting problems whose decision versions lie in NP [28]. Subsequently, Toda 
demonstrated the theoretical hardness of the problem by showing that every 
problem in the entire polynomial hierarchy can be solved by just one call to a 
#P oracle; more formally, PH C P*? [27]. 

Given the computational intractability of #SAT, there has been sustained 
interest in the development of approximate techniques from theoreticians and 
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practitioners alike. Stockmeyer introduced a randomized hashing-based tech- 
nique that provides (¢,6)-guarantees (formally defined in Sect. 2) given access 
to an NP oracle [25]. Given the lack of practical solvers that could handle 
problems in NP satisfactorily, there were no practical implementations of Stock- 
meyere’s hashing-based techniques until the 2000s [14]. Building on the unprece- 
dented advancements in the development of SAT solvers, Chakraborty, Meel, 
and Vardi extended Stockmeyer’s framework to a scalable (£, 6)-counting algo- 
rithm, ApproxMC [7]. The subsequent years have witnessed a sustained inter- 
est in further optimizations of the hashing-based techniques for approximate 
counting [5,6,10,11,17-19, 23, 29,30]. The current state-of-the-art technique for 
approximate counting is a hashing-based framework called ApproxMC, which is 
in its fourth version, called ApproxMC4 [22,24]. 

The core theoretical idea behind the hashing-based framework is to use 2- 
universal hash functions to partition the solution space, denoted by sol(F) for a 
formula F, into roughly equal small cells, wherein a cell is considered small if 
it contains solutions less than or equal to a pre-computed threshold, thresh. An 
NP oracle (in practice, a SAT solver) is employed to check if a cell is small by 
enumerating solutions one-by-one until either there are no more solutions or we 
have already enumerated thresh + 1 solutions. Then, we randomly pick a cell, 
enumerate solutions within the cell (if the cell is small), and scale the obtained 
count by the number of cells to obtain an estimate for |sol(F)|. To amplify the 
confidence, we rely on the standard median technique: repeat the above process, 
called ApproxMCCore, multiple times and return the median. Computing the 
median amplifies the confidence as for the median of t repetitions to be outside 


Iso (1+ <)|sol(F)| ), it should be the case that at 


least half of the repetitions of ApproxMCCore returned a wrong estimate. 

In practice, every subsequent repetition of ApproxMCCore takes a similar 
time, and the overall runtime increases linearly with the number of invocations. 
The number of repetitions depends logarithmically on 5~!. As a particular exam- 
ple, for « = 0.8, the number of repetitions of ApproxMCCore to attain 6 = 0.1 
is 21, which increases to 117 for 6 = 0.001: a significant increase in the number 
of repetitions (and accordingly, the time taken). Accordingly, it is no surprise 
that empirical analysis of tools such as ApproxMC has been presented with a 
high delta (such as 6 = 0.1). On the other hand, for several applications, such as 
network reliability, and quantitative verification, the end users desire estimates 
with high confidence. Therefore, the design of efficient counting techniques for 
small 6 is a major challenge that one needs to address to enable the adoption of 
approximate counting techniques in practice. 

The primary contribution of our work is to address the above challenge. 
We introduce a new technique called rounding that enables dramatic reduc- 
tions in the number of repetitions required to attain a desired value of confi- 
dence. The core technical idea behind the design of the rounding technique is 
based on the following observation: Let L (resp. U) refer to the event that a 
given invocation of ApproxMCCore under (resp. over)-estimates |sol(F)|. For a 


the desired range (i.e., | 
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median estimate to be wrong, either the event L happens in half of the invo- 
cations of ApproxMCCore or the event U happens in half of the invocations 
of ApproxMCCore. The number of repetitions depends on max(Pr[|L], Pr[U]). 
The current algorithmic design (and ensuing analysis) of ApproxMCCore pro- 
vides a weak upper bound on max{Pr[L], Pr[U]}: in particular, the bounds on 
max{Pr[Z], Pr[U]} and Pr[LUU] are almost identical. Our key technical contribu- 
tion is to design a new procedure, ApproxMC6Core, based on the rounding tech- 
nique that allows us to obtain significantly better bounds on max{Pr[L], Pr[U]}. 

The resulting algorithm, called ApproxMC6, follows a similar structure 
to that of ApproxMC: it repeatedly invokes the underlying core procedure 
ApproxMCéCore and returns the median of the estimates. Since a single invo- 
cation of ApproxMCé6Core takes as much time as ApproxMCCore, the reduction in 
the number of repetitions is primarily responsible for the ensuing speedup. As 
an example, for € = 0.8, the number of repetitions of ApproxMC6Core to attain 
ô = 0.1 and 6 = 0.001 is just 5 and 19, respectively; the corresponding num- 
bers for ApproxMC were 21 and 117. An extensive experimental evaluation on 
1890 benchmarks shows that the rounding technique provided 4x speedup than 
the state-of-the-art approximate model counter, ApproxMC. Furthermore, for a 
given timeout of 5000s, ApproxMC6 solves 204 more instances than ApproxMC 
and achieves a reduction of 1063s in the PAR-2 score. 

The rest of the paper is organized as follows. We introduce notation and 
preliminaries in Sect.2. To place our contribution in context, we review related 
works in Sect. 3. We identify the weakness of the current technique in Sect. 4 and 
present the rounding technique in Sect. 5 to address this issue. Then, we present 
our experimental evaluation in Sect. 6. Finally, we conclude in Sect. 7. 


2 Notation and Preliminaries 


Let F be a Boolean formula in conjunctive normal form (CNF), and let Vars(F’) 
be the set of variables appearing in F. The set Vars(F’) is also called the support 
of F. An assignment o of truth values to the variables in Vars( F) is called a 
satisfying assignment or witness of F if it makes F evaluate to true. We denote 
the set of all witnesses of F by sol(F). Throughout the paper, we will use n to 
denote |Vars(F’)|. 

The propositional model counting problem is to compute |sol(F)| for a given 
CNF formula F. A probably approximately correct (or PAC) counter is a proba- 
bilistic algorithm ApproxCount(-,-,-) that takes as inputs a formula F’, a tolerance 
parameter € > 0, and a confidence parameter 6 € (0,1], and returns an (£, ô)- 


estimate c, i.e., Pr [Ae < c< (1+ ¢)|sol(F)|} > 1— ô. PAC guarantees are also 


sometimes referred to as (£, 6)-guarantees. 

A closely related notion is projected model counting, where we are interested 
in computing the cardinality of sol(F) projected on a subset of variables P C 
Vars( F). While for clarity of exposition, we describe our algorithm in the context 
of model counting, the techniques developed in this paper are applicable to 
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projected model counting as well. Our empirical evaluation indeed considers 
such benchmarks. 


2.1 Universal Hash Functions 


Let n,m € N and H(n, m) = {h : {0,1}" — {0,1}™} be a family of hash func- 
tions mapping {0,1}” to {0,1}. We use h £ H(n, m) to denote the probability 
space obtained by choosing a function h uniformly at random from H(n, m). To 
measure the quality of a hash function we are interested in the set of elements of 
sol(F) mapped to a by h, denoted Cell; F n,a) and its cardinality, i.e., |Cell; F n,a) |: 
We write Pr[Z : 2] to denote the probability of outcome Z when sampling from 
a probability space (2. For brevity, we omit (2 when it is clear from the context. 
The expected value of Z is denoted E [Z] and its variance is denoted o?[Z]. 


Definition 1. A family of hash functions H(n,m) is strongly 2-universal if 
ve, y € {0,1}", a € {0,1}", h E (n,m), 
1 
Pr [A(z) = a] = zm = Pr[h(x) = A(y)] 

For h È H(n,n) and Ym € {1,...,n}, the mt” prefix-slice of h, denoted h'™, is 
a map from {0,1}” to {0,1}™, such that h™ (y)fi] = h(y)[i], for all y € {0,1}” 
and for all i € {1,...,m}. Similarly, the mt” prefix-slice of a € {0,1}", denoted 
a™), is an element of {0,1}™ such that a(™[i] = afi] for all i € {1,...,m}. 
To avoid cumbersome terminology, we abuse notation and write Cell; r m) (resp. 
Cnt Fm) ) as a short-hand for Cell; F pm) am) (resp. |Cell P pen atm); |)- The fol- 
lowing proposition presents two results that are frequently used throughout this 
paper. The proof is deferred to Appendix A. 


Proposition 1. For every 1 < m < n, the following holds: 
sol(F 
E [Cntir,m)] = el (1) 
o [Cnt (7m) | <E [Cnt Fm] (2) 


The usage of prefix-slice of h ensures monotonicity of the random variable, 
Cnt(7m), since from the definition of prefix-slice, we have that for every 1 < 
m <n, AOD (y) = alt) > hM) (y) = a™., Formally, 


Proposition 2. For every 1 < m < n, Celli r,m+1) C Celli r m) 


2.2 Helpful Combinatorial Inequality 
Lemma 1. Let n(t, m, p) = D (i)a —p)-* and p < 0.5, then 


n, [t/21.0) € © (1 (2V0 =a) 
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Proof. We will derive both an upper and a matching lower bound for 

n(t, [t/2],p). We begin by deriving an upper bound: 7(t,[t/2],p) = 
t i ; t 

Vero < (ra) Zr- ran? *(1—p)*-* < a olp 2) 


eee ts t x (2 i . eT ere a- st aot. {= 
Se oo (w(t — p) -l - 
p))2 . =r The last inequality follows Stirling’s approximation. As a result, 


t 
n(t, [t/2],p) € O (= (2 p(1 =p) ). Afterwards; we move on to deriving a 
matching lower bound: n(t, [t/2], p) = 524 E 1 (,)p*(1—p)'-* > (reja)P Isl(1— 
t— [5 1. . t . t 6t = -t sot à l= 2 
i 7 Tm (4-0.5) ($+0.5) (H) ae as oe) 


p2(1—p)2- Tare The last inequality again follows Stirling’s approximation. 


t 
Hence, n(t, [t/2],p) € 2 (=: (2 p(l =p) | Combining these two bounds, 


we conclude that n(t, [t/2], p) € © (= (2 p(l =D) "); 


3 Related Work 


The seminal work of Valiant established that #SAT is #P-complete [28]. Toda 
later showed that every problem in the polynomial hierarchy could be solved 
by just a polynomial number of calls to a #P oracle [27]. Based on Carter and 
Wegman’s seminal work on universal hash functions [4], Stockmeyer proposed a 
probabilistic polynomial time procedure, with access to an NP oracle, to obtain 
an (€,6)-approximation of F [25]. 

Built on top of Stockmeyer’s work, the core theoretical idea behind the 
hashing-based approximate solution counting framework, as presented in Algo- 
rithm 1 (ApproxMC [7]), is to use 2-universal hash functions to partition the 
solution space (denoted by sol(F) for a given formula F) into small cells of 
roughly equal size. A cell is considered small if the number of solutions it con- 
tains is less than or equal to a pre-determined threshold, thresh. An NP oracle is 
used to determine if a cell is small by iteratively enumerating its solutions until 
either there are no more solutions or thresh + 1 solutions have been found. In 
practice, an SAT solver is used to implement the NP oracle. To ensure a polyno- 
mial number of calls to the oracle, the threshold, thresh, is set to be polynomial 
in the input parameter € at Line 1. The subroutine ApproxMCCore takes the 
formula F and thresh as inputs and estimates the number of solutions at Line 7. 
To determine the appropriate number of cells, i.e., the value of m for H(n,m), 
ApproxMCCore uses a search procedure at Line 3 of Algorithm 2. The estimate 
is calculated as the number of solutions in a randomly chosen cell, scaled by 
the number of cells, i.e., 2™ at Line 5. To improve confidence in the estimate, 
ApproxMC performs multiple runs of the ApproxMCCore subroutine at Lines 5- 
9 of Algorithm 1. The final count is computed as the median of the estimates 
obtained at Line 10. 
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Algorithm 1. ApproxMC(F;, €, ô) 
thresh — 9.84 (1+ 76) (1+ 4)"; 
Y <— BoundedSAT (F, thresh); 
if (/Y| < thresh) then return |Y |; 
t — [17 log,(3/5)] ; C — emptyList; iter — 0; 
repeat 
iter — iter + 1; 
nSols — ApproxMCCore(F, thresh); 
AddToList(C, nSols); 
until (iter > t); 
: finalEstimate — FindMedian(C); 
: return finalEstimate; 


POS 00 Oe ON be eh 


= m= 


Algorithm 2. ApproxMCCore(F', thresh) 


1: Choose h at random from H(n, n); 
: Choose a at random from {0,1}”; 
: m — LogSATSearch(F, h, a, thresh); 


2 
3 
-1 
4: Cntir, m) — BoundedSAT (FA (nim) (a) thresh} : 
5 


: return (2™ x Cnt(p my); 


In the second version of ApproxMC [8], two key algorithmic improvements 
are proposed to improve the practical performance by reducing the number of 
calls to the SAT solver. The first improvement is using galloping search to more 
efficiently find the correct number of cells, i.e., LogSATSearch at Line 3 of Algo- 
rithm 2. The second is using linear search over a small interval around the 
previous value of m before resorting to the galloping search. Additionally, the 
third and fourth versions [22,23] enhance the algorithm’s performance by effec- 
tively dealing with CNF formulas conjuncted with XOR constraints, commonly 
used in the hashing-based counting framework. Moreover, an effective prepro- 
cessor named Arjun [24] is proposed to enhance ApproxMC’s performance by 
constructing shorter XOR constraints. As a result, the combination of Arjun and 
ApproxMC4 solved almost all existing benchmarks [24], making it the current 
state of the art in this field. 

In this work, we aim to address the main limitation of the ApproxMC algo- 
rithm by focusing on an aspect that still needs to be improved upon by previous 
developments. Specifically, we aim to improve the core algorithm of ApproxMC, 
which has remained unchanged. 


4 Weakness of ApproxMC 


As noted above, the core algorithm of ApproxMC has not changed since 2016, 
and in this work, we aim to address the core limitation of ApproxMC. To put our 
contribution in context, we first review ApproxMC and its core algorithm, called 
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ApproxMCCore. We present the pseudocode of ApproxMC and ApproxMCCore in 
Algorithms 1 and 2, respectively. ApproxMCCore may return an estimate that 


falls outside the PAC range [e (1+ <)|sol(F)| with a certain probability of 


error. Therefore, ApproxMC repeatedly invokes ApproxMCCore (Lines 5- 9) and 
returns the median of the estimates returned by ApproxMCCore (Line 10), which 
reduces the error probability to the user-provided parameter 6. 

Let Error, denote the event that the median of t estimates falls out- 


side [EM (1 + ¢)|sol(F)|]. Let L denote the event that an invocation 


Ipe ? 

ApproxMCCore returns an estimate less than JP, Similarly, let U denote the 
event that an individual estimate of |sol(F)| is greater than (1+¢)|sol(F)|. For sim- 
plicity of exposition, we assume t is odd; the current implementation of t indeed 
ensures that t is odd by choosing the smallest odd t for which Pr[Error;] < ô. 

In the remainder of the section, we will demonstrate that reducing 
max {Pr[Z],Pr[U]} can effectively reduce the number of repetitions t, mak- 
ing the small-d scenarios practical. To this end, we will first demonstrate the 
existing analysis technique of ApproxMC leads to loose bounds on Pr[Error;]. We 
then present a new analysis that leads to tighter bounds on Pr[Error;]. 

The existing combinatorial analysis in |7] derives the following proposition: 


Proposition 3. 


Pr [Errori] < y(t, [t/2], Pr[L U UJ) 


where n(t, m, p) = pom (KPA — p». 


Proposition 3 follows from the observation that if the median falls outside 
the PAC range, at least [t/2] of the results must also be outside the range. Let 
n(t, [t/2], Pr [LU U]) < 6, and we can compute a valid t at Line 4 of ApproxMC. 

Proposition 3 raises a question: can we derive a tight upper bound for 
Pr [Error,]? The following lemma provides an affirmative answer to this ques- 
tion. 


Lemma 2. Assuming t is odd, we have: 
Pr |Error,] = n(t, [t/2], Pr [L]) + n(t, [t/21, Pr{U]) 


Proof. Let I} be an indicator variable that is 1 when ApproxMCCore returns a 
nSols less than [oP] indicating the occurrence of event L in the i-th repetition. 


Ee? 


Let IY be an indicator variable that is 1 when ApproxMCCore returns a nSols 
greater than (1+¢)|sol(F)|, indicating the occurrence of event U in the i-th repeti- 


tion. We aim first to prove that Error, = as I2 [#1) V anr LS [£]). 
We will begin by proving the right (=) implication. If the median of t esti- 
mates violates the PAC guarantee, the median is either less than pelt or 
greater than (1 + ¢)|sol(F)|. In the first case, since half of the estimates are 
less than the median, at least 4] estimates are less than a Formally, this 
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implies Sa I} > [4]. Similarly, in the case that the median is greater than 


(1+e)|sol(F)|, since half of the estimates are greater than the median, at least | 4] 


estimates are greater than (1+¢)|sol(F)|, thus formally implying a IY > fż]. 


u 


On the other hand, we prove the left (<) implication. Given $`‘; I} > [4], 


i=1 ti Z |2 
more than half of the estimates are less than eae and therefore the median is 


1l+e 
sol(F : : Cee : 
less than IDI, violating the PAC guarantee. Similarly, given X a IY > [4], 


more than half of the estimates are greater than (1 + €)|sol(F)|, and therefore 
the median is greater than (1 + ¢)|sol(F)|, violating the PAC guarantee. This 
concludes the proof of Error, = Coe IŁ > [4] ) v os I> [$]). Then we 
obtain: 


+ 


Pr [Error,] = Pr (>: rs em) V (>: I > wal) 
($= w2)]+e| (S20) 


i=1 


— Pr (£: > wm) ^ (Soa? > wa) 


i=1 


+ Pr 


Given If + IY < 1 for i = 1,2,...,t, Ð; (F + 12) < tis there, but if 
(Eia TE > [e/21) A (Dhar TY = [t/21) is also given, we obtain Y- (IE + 
IV) > t+1 contradicting 5 (IF + I¥) < t; Hence, we can conclude that 
Pr (x! iS (¢/21) A Os IY > [«/21)] = 0. From this, we can deduce: 


Pr [Error,] = Pr (>: Ez va) (>: IY > wa) 


= n(t, [t/2], Pr [L]) + n(t, [¢/2], Pr [U]) 


+Pr 


Though Lemma 2 shows that reducing Pr |L] and Pr [U] can decrease the error 
probability, it is still uncertain to what extent Pr [L] and Pr [U] affect the error 
probability. To further understand this impact, the following lemma is presented 
to establish a correlation between the error probability and t depending on Pr [L] 
and Pr [U]. 


Lemma 3. Let pmax = max {Pr[L],Pr[U]} and pmax < 0.5, we have 


i t 
Pr |Error;] EO (= (2 Pmaa(1 = Pmax)) ) 
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Proof. Applying Lemmas 1 and 2, we have 


adeg (= (vP: [L] (1 — Pr ED) + (2vPrU] 0 -Pr @D)')) 
= 0 (1t (2Vnaall Pas) ) 


In summary, Lemma 3 provides a way to tighten the bound on Pr[Error; 
by designing an algorithm such that we can obtain a tighter bound on pmaz 
in contrast to previous approaches that relied on obtaining a tighter bound on 
Pr[L U U]. 


5 Rounding Model Counting 


In this section, we present a rounding-based technique that allows us to obtain 
a tighter bound on Pmax. On a high-level, instead of returning the estimate from 
one iteration of the underlying core algorithm as the number of solutions in a 
randomly chosen cell multiplied by the number of cells, we round each estimate of 
the model count to a value that is more likely to be within (1 + £)-bound. While 
counter-intuitive at first glance, we show that rounding the estimate reduces 
max {Pr [L] , Pr [U]}, thereby resulting in a smaller number of repetitions of the 
underlying algorithm. 

We present ApproxMC6, a rounding-based approximate model counting algo- 
rithm, in Sect.5.1. Section5.2 will demonstrate how ApproxMC6 decreases 
max {Pr [L], Pr [U]} and the number of estimates. Lastly, in Sect. 5.3, we will 
provide proof of the theoretical correctness of the algorithm. 


5.1 Algorithm 


Algorithm 3 presents the procedure of ApproxMC6. ApproxMC6 takes as 
input a formula F, a tolerance parameter ¢, and a confidence param- 
eter 6. ApproxMC6 returns an (e,ô)-estimate c of |sol(F)| such that 


Pr [ae <e<(1+ <)|sol(F)] > 1— ô. ApproxMCé is identical to ApproxMC in 


its initialization of data structures and handling of base cases (Lines 1—4). 

In Line 5, we pre-compute the rounding type and rounding value to be 
used in ApproxMC6Core. configRound is implemented in Algorithm 5; the precise 
choices arise due to technical analysis, as presented in Sect.5.2. Note that, in 
configRound, Cntr m) is rounded up to roundValue for £ < 3 (roundUp = 1) but 
rounded to roundValue for € > 3 (roundUp = 0). Rounding up means we assign 
roundValue to Cnt; rm) if Cnt f,m) is less than roundValue and, otherwise, keep 
Cnt; F m) unchanged. Rounding means that we assign roundValue to Cnt; p m) in 
all cases. ApproxMC6 computes the number of repetitions necessary to lower error 
probability down to 6 at Line 6. The implementation of computelter is presented 
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Algorithm 3. ApproxMC6(F; £, ô) 
thresh — 9.84 (1+ 76) (1+ 4)"; 
Y < BoundedSAT (F, thresh); 
if (/Y| < thresh) then return |Y |; 
C <— emptyList; iter — 0; 
(roundUp, roundValue) — configRound(e) 
t — computelter(e, ô) 
repeat 
iter — iter + 1; 
nSols — ApproxMCé6Core(F’, thresh, roundUp, roundValue); 
AddToList(C, nSols); 
: until (iter > t); 
: finalEstimate — FindMedian(C); 
: return finalEstimate ; 


= 


— 
H 


=e 
wn 


in Algorithm 6 following Lemma 2. The iterator keeps increasing until the tight 
error bound is no more than 6. As we will show in Sect.5.2, Pr [L] and Pr [U] 
depend on €. In the loop of Lines 7-11, ApproxMCé6Core repeatedly estimates 
|sol(F)|. Each estimate nSols is stored in List C, and the median of C serves as 
the final estimate satisfying the (£, 6)-guarantee. 

Algorithm 4 shows the pseudo-code of ApproxMCé6Core. A random hash func- 
tion is chosen at Line 1 to partition sol(F) into roughly equal cells. A random 
hash value is chosen at Line 2 to randomly pick a cell for estimation. In Line 3, 
we search for a value m such that the cell picked from 2” available cells is small 
enough to enumerate solutions one by one while providing a good estimate of 
|sol(F)|. In Line 4, a bounded model counting is invoked to compute the size of the 
picked cell, i.e., Cntr m). Finally, if roundUp equals 1, Cnt; rm) is rounded up to 
roundValue at Line 6. Otherwise, roundUp equals 0, and Cnt;p,m) is rounded to 
roundValue at Line 8. Note that rounding up returns roundValue only if Cnt; p m) 
is less than roundValue. However, in the case of rounding, roundValue is always 
returned no matter what value Cnt; p m) is. 

For large £ (e > 3), ApproxMCé6Core returns a value that is independent of 
the value returned by BoundedSAT in line 4 of Algorithm 4. However, observe 
the value depends on m returned by LogSATSearch [8], which in turn uses 
BoundedSAT to find the value of m; therefore, the algorithm’s run is not indepen- 
dent of all the calls to BoundedSAT. The technical reason for correctness stems 
from the observation that for large values of €, we can always find a value of m 
such that 2” x c (where c is a constant) is a (1+ ¢)-approximation of |sol(F)|. An 
example, consider n = 7 and let c = 1, then a (1+3)-approximation of a number 
between 1 and 128 belongs to |1, 2, 4,8, 16,32,64, 128]; therefore, returning an 
answer of the form c x 2” suffices as long as we are able to search for the right 
value of m, which is accomplished by LogSATSearch. We could skip the final call 
to BoundedSAT in line 4 of ApproxMCé6Core for large values of £, but the actual 
computation of BoundedSAT comes with LogSATSearch. 


142 J. Yang and K. S. Meel 


Algorithm 4. ApproxMC6Core( F, thresh, roundUp, roundValue) 


1: Choose h at random from H(n, n); 
: Choose a at random from {0,1}”; 
: m — LogSATSearch(F,, h, a, thresh); 


2 
3 
=í 

4: Cnterm) — BoundedSAT (FA (a) (a) thresh} ; 
5: if roundUp = 1 then 
6 
7 
8 


return (2™ x max{Cnt;r m), roundValue}); 
: else 
return (2™ x roundValue); 


Algorithm 5. configRound(€) 


: if (e < V2 — 1) then return (1, +% pivot); 
. : ivot \, 

: else if (e < 1) then return (1, Y3); 

: else if (e < 3) then return (1, pivot); 


1 
2 
3 
4: else if (e£ < 4V2 — 1) then return (0, pivot); 
5 
6 


: else 
return (0, vV2pivot); 


5.2 Repetition Reduction 


We will now show that ApproxMC6Core allows us to obtain a smaller 

max {Pr [L] , Pr [U]}. Furthermore, we show the large gap between the error prob- 

ability of ApproxMC6 and that of ApproxMC both analytically and visually. 
The following lemma presents the upper bounds of Pr [L] and Pr[U] for 


ApproxMC6Core. Let pivot = 9.84 (1 + iy for simplicity. 
Lemma 4. The following bounds hold for ApproxMC6: 


0.262 ife<V2-1 
0.157 ifvV2-1<e<1 


Pr[L] < < 0.085 ifl<e<3 
0.055 if3<e<4/2-1 
0.023 ife>4/2-1 
Pr 


E e ife<3 

0.044 ife>3 
The proof of Lemma 4 is deferred to Sect.5.3. Observe that Lemma 4 influ- 
ences the choices in the design of configRound (Algorithm 5). Recall that 
max {Pr [L] , Pr [U]} < 0.36 for ApproxMC (Appendix C), but Lemma 4 ensures 
max {Pr [L] , Pr [U]} < 0.262 for ApproxMC6. For £ > 4V2 — 1, Lemma 4 even 
delivers max {Pr [L] , Pr [U]} < 0.044. 
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Algorithm 6. computelter(e, ô) 


1: iter — 1; 

2: while (n(iter, [iter/2], Pr-[L]) + n(iter, [iter/2], Pr-[U]) > ô) do 
3: iter — iter + 2; 

4: return iter; 


The following theorem analytically presents the gap between the error prob- 
ability of ApproxMC6 and that of ApproxMC!. 


Theorem 1. For /2—l<e< 1, 


O (t-20.75') for ApproxMC6 


Pr [Error] € i 
O (t~20.96¢} for ApproxMC 


Proof. From Lemma 4, we obtain Pmax < 0.169 for ApproxMC6. Applying 
Lemma 3, we have 


1 t 1 
Pr [Etror,] € © (= (2 0.169(1 — 0.169) ) ) co (-40.75") 
For ApproxMC, combining Pmax < 0.36 (Appendix C) and Lemma 3, we obtain 


Pr [Etror,] € O (= (2 0.360 — 036)) ) =O («-0.96") 


Figure 1 visualizes the large gap between the error probability of ApproxMC6 
and that of ApproxMC. The x-axis represents the number of repetitions (t) in 
ApproxMCé6 or ApproxMC. The y-axis represents the upper bound of error proba- 
bility in the log scale. For example, as t = 117, ApproxMC guarantees that with a 
probability of 1073, the median over 117 estimates violates the PAC guarantee. 
However, ApproxMC6 allows a much smaller error probability that is at most 
10715 for /2 —1 < e < 1. The smaller error probability enables ApproxMC6 
to repeat fewer repetitions while providing the same level of theoretical guar- 
antee. For example, given ô = 0.001 to ApproxMC, i.e., y = 0.001 in Fig. 1, 
ApproxMC requests 117 repetitions to obtain the given error probability. How- 
ever, ApproxMC6 claims that 37 repetitions for € < v2 — 1, 19 repetitions for 
J2—1<e<1, 17 repetitions for 1 < € < 3, 7 repetitions for 3 < € < 4/2 — 1, 
and 5 repetitions for € > 4V2 — 1 are sufficient to obtain the same level of error 
probability. Consequently, ApproxMC6 can obtain 3x, 6x, 7x, 17x, and 23x 
speedups, respectively, than ApproxMC. 


1 We state the result for the case V2—1 < € < 1. A similar analysis can be applied to 
other cases, which leads to an even bigger gap between ApproxMC6 and ApproxMC. 
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Fig. 1. Comparison of error bounds for ApproxMC6 and ApproxMC. 


5.3 Proof of Lemma 4 for Case v2 — 1 Le<1 


We provide full proof of Lemma 4 for case v2 — 1 < e < 1. We defer the proof 
of other cases to Appendix D. 
Let Tm denote the event (Cntir m) < thresh), and let Lm and Um denote the 


events (Ctr < erel) and (Cntypm) > E [Cnt Fm) | (1+€)), respec- 


l1+e 
and thereby Um C U!,. Let m* = |log, |sol(F)| — log, (pivot) + 1| such that m* 
is the smallest m satisfying oD (q + pe) < thresh — 1, 
Let us first prove the lemmas used in the proof of Lemma 4. 


tively. To ease the proof, let U/, denote (Crt dm) > E [Cntr m] (1+ 74) ); 


Lemma 5. For every 0 < B <1, y >1, and1 <m <n, the following holds: 


1. Pr [Cnt Fm) < BE [Cnt em) |] < THU AE] 
2. Pr [Cnt Fm) >E [Cnt (em) || < : 


E 1+(y—1)?E|Cnt;F,m)] 
Proof. Statement 1 can be proved following the proof of Lemma 1 in [8]. For 
statement 2, we rewrite the left-hand side and apply Cantelli’s inequality: 


a? [Cnt re 
Pr [Cnty em) —E [Cnt¢m) | >(y—1)E [Cnt (em) |] 2a mete 


Finally, applying Eq. 2 completes the proof. 


Lemma 6. Given /2—1< «<1, the following bounds hold: 
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Proof. Following the proof of Lemma 2 in [8], we can prove statements 1, 2, and 
3. To prove statement 4, replacing y with (1 + $z) in Lemma 5 and employing 
E [Cntyem«)| > pivot/2, we obtain Pr [U/,.] < 


ue a 1 
1+( a) pivot/2 7 5.92° 


Now we prove the upper bounds of Pr [L] and Pr [U] in Lemma 4 for /2—1 < 
e€ <1. The proof for other £ is deferred to Appendix D due to the page limit. 
Lemma 4. The following bounds hold for ApproxMC6: 


0.262 ife<V2-1 
0.157 if vV2-1<eg<1 
Pr[L] < 4 0.085 if1<e<3 
0.055 if3<e<4/2-1 
0.023 ife>4/2-1 


0.169 ife <3 


Pr [U] < . 
0.044 ife>3 


Proof. We prove the case of V2 — 1 < e < 1. The proof for other £ is deferred to 
Appendix D. Let us first bound Pr [L]. Following LogSATSearch in [8], we have 


Prif}=| U (TnT na L:) (3) 


Equation 3 can be simplified by three observations labeled O1, O2 and O3 below. 


O1 : Vi < m* — 3, T; C T;41. Therefore, 


Tanna U GE Tmw-s 


O2 :|For i € {m* — 2,m* — 1}, we have 


(Ti-1 O T; A Li) C Lm*—2U Dm»—1 
ic{m*—2,m*—1} 


03 : Vi > m*, since rounding Cnt,p;) up to piot and m* > logs |sol(F)| — 
log, (pivot), we have 2’ x Cntipi > 2” x > eae 2 sot The last 


E[Cnt ym) | 
1+e Š 


inequality follows from ¢ > v2 -— 1. Then we have Cntipi > 
Therefore, L; = @ for i > m* and we have 
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Following the observations O1, O2, and O3, we simplify Eq. 3 and obtain 
Pr [L] < Pr [Zm«—3] + Pr [Lm*—2] + Pr [Lm 1] 


Employing Lemma 6 gives Pr [L] < 0.157. 
Now let us bound Pr [U]. Similarly, following LogSATSearch in [8], we have 


PrUj=| U Gannny) (4) 


We derive the following observations O4 and O5. 


O4 : Vi < m* — 1, since m* < logy |sol(F)| — log, (pivot) + 1, we have 2f x 


Cntyriy < 2m*—=1 x thresh < |sol(F)| (1 + te): Then we obtain Cnt; pi < 


E [Cntir a] (1 + i): Therefore, T; N U; = Ø for i < m* — 1 and we have 


(TNT; NU) c U (TNT; NU;)=0 


O5 : Vi > m*, T; implies Cnt;p;) > thresh, and then we have 2’ x Cnt piy 


2™* x thresh > |sol(F)| (1 + i): The second inequality follows from m* 


> 
2 
log. |sol(F)| — log, (pivot). Then we obtain Cntr > E [Cntr] (1 + te): 
Therefore, T; C U! for i > m*. Since Vi, T; C T;_1, we have 


U mannaa UO Ta Ga Tne VU me) 
i€{m*,....n} i€{m*+1,...,n} 

© Tm” U (Linea N Tm Ua) 

C Tm” U Um” 

C Um (5) 
Remark that for V2— 1 < € < 1, we round Cntr m*) up to a and we 
have 2™* x a < |sol(F)|(1 + £), which means rounding doesn’t affect the 
event U,,«; therefore, Inequality 5 still holds. 


Following the observations O4 and O5, we simplify Eq. 4 and obtain 


Pr [U] < Pr [Um] 


Employing Lemma 6 gives Pr [U] < 0.169. 


The breakpoints in € of Lemma 4 arise from how we use rounding to lower 
the error probability for events L and U. Rounding up counts can lower Pr [L 
but may increase Pr [U]. Therefore, we want to round up counts to a value that 
doesn’t affect the event U. Take /2—1 < € < 1 as an example; we round up the 
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count to a value such that Lm» becomes an empty event with zero probability 
while Um» remains unchanged. To make Lm» empty, we have 


A 7 1 
2™ x roundValue > 2™ x en pivot > |sol(F)| (6) 


E l+e 


where the last inequality follows from m* > log, |sol(F)| — log, (pivot). To main- 
tain Um» unchanged, we obtain 


2 P 
2™ x roundValue < 2™ x si 


pivot < (1 + €)|sol(F)| (7) 


where the last inequality follows from m* < log, |sol(F)| — log, (pivot) + 1. Com- 
bining Eqs. 6 and 7 together, we obtain 


* * 1 
Dh X pivot <2” x ts 


ivot 
l+e p 


which gives us ¢ > v2 — 1. Similarly, we can derive other breakpoints. 


6 Experimental Evaluation 


It is perhaps worth highlighting that both ApproxMCCore and ApproxMC6Core 
invoke the underlying SAT solver on identical queries; the only difference between 
ApproxMC6 and ApproxMC lies in what estimate to return and how often 
ApproxMCCore and ApproxMCé6Core are invoked. From this viewpoint, one would 
expect that theoretical improvements would also lead to improved runtime per- 
formance. To provide further evidence, we perform extensive empirical evalua- 
tion and compare ApproxMCé’s performance against the current state-of-the-art 
model counter, ApproxMC [22]. We use Arjun as a pre-processing tool. We used 
the latest version of ApproxMC, called ApproxMC4; an entry based on ApproxMC4 
won the Model Counting Competition 2022. 

Previous comparisons of ApproxMC have been performed on a set of 1896 
instances, but the latest version of ApproxMC is able to solve almost all the 
instances when these instances are pre-processed by Arjun. Therefore, we sought 
to construct a new comprehensive set of 1890 instances derived from various 
sources, including Model Counting Competitions 2020-2022 [12, 15,16], program 
synthesis [1], quantitative control improvisation [13], quantification of software 
properties [26], and adaptive chosen ciphertext attacks [3]. As noted earlier, our 
technique extends to projected model counting, and our benchmark suite indeed 
comprises 772 projected model counting instances. 

Experiments were conducted on a high-performance computer cluster, with 
each node consisting of 2xE5-2690v3 CPUs featuring 2 x 12 real cores and 96GB 
of RAM. For each instance, a counter was run on a single core, with a time limit 
of 5000s and a memory limit of 4GB. To compare runtime performance, we use 
the PAR-2 score, a standard metric in the SAT community. Each instance is 
assigned a score that is the number of seconds it takes the corresponding tool to 
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complete execution successfully. In the event of a timeout or memory out, the 

score is the doubled time limit in seconds. The PAR-2 score is then calculated as 

the average of all the instance scores. We also report the speedup of ApproxMC6 

over ApproxMC4, calculated as the ratio of the runtime of ApproxMC4 to that of 

ApproxMCé6 on instances solved by both counters. We set 6 to 0.001 and e€ to 0.8. 
Specifically, we aim to address the following research questions: 


RQ 1. How does the runtime performance of ApproxMC6 compare to that of 
ApproxMC4? 

RQ 2. How does the accuracy of the counts computed by ApproxMC6 compare 
to that of the exact count? 


Summary. In summary, ApproxMC6 consistently outperforms ApproxMC4. 
Specifically, it solved 204 additional instances and reduced the PAR-2 score by 
1063s in comparison to ApproxMC4. The average speedup of ApproxMC6 over 
ApproxMC4 was 4.68. In addition, ApproxMC6 provided a high-quality approxi- 
mation with an average observed error of 0.1, much smaller than the theoretical 
error tolerance of 0.8. 


6.1 RQ1. Overall Performance 


Figure 2 compares the counting time of ApproxMC6 and ApproxMC4. The z-axis 
represents the index of the instances, sorted in ascending order of runtime, and 
the y-axis represents the runtime for each instance. A point (x,y) indicates that 
a counter can solve x instances within y seconds. Thus, for a given time limit y, 
a counter whose curve is on the right has solved more instances than a counter 
on the left. It can be seen in the figure that ApproxMCé6 consistently outperforms 
ApproxMC4. In total, ApproxMC6 solved 204 more instances than ApproxMC4. 

Table 1 provides a detailed comparison between ApproxMC6 and ApproxMC4. 
The first column lists three measures of interest: the number of solved instances, 
the PAR-2 score, and the speedup of ApproxMC6 over ApproxMC4. The second 
and third columns show the results for ApproxMC4 and ApproxMC6, respec- 
tively. The second column indicates that ApproxMC4 solved 998 of the 1890 
instances and achieved a PAR-2 score of 4934. The third column shows that 
ApproxMCé6 solved 1202 instances and achieved a PAR-2 score of 3871. In com- 
parison, ApproxMC6 solved 204 more instances and reduced the PAR-2 score 
by 1063s in comparison to ApproxMC4. The geometric mean of the speedup 
for ApproxMCé over ApproxMC4 is 4.68. This speedup was calculated only for 
instances solved by both counters. 


6.2 RQ2. Approximation Quality 


We used the state-of-the-art probabilistic exact model counter Ganak to compute 
the exact model count and compare it to the results of ApproxMC6. We collected 
statistics on instances solved by both Ganak and ApproxMC6. Figure3 presents 
results for a subset of instances. The x-axis represents the index of instances 
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Table 1. The number of solved instances and PAR-2 score for ApproxMC6 versus 


ApproxMC4 on 1890 instances. The geometric mean of the speedup of ApproxMC6 over 
ApproxMC4 is also reported. 


ApproxMC4 ApproxMC6 


# Solved 998 1202 
PAR-2 score 4934 3871 
Speedup = 4.68 
5000 i ; 
—— ApproxMC6 : 
¥ 3000 oes inks ee cee. Seer sch fois 
E 
5 2000 
1000+ 
% 200 400 600 800 1000 1200 


Instance Index 


Fig. 2. Comparison of counting times for ApproxMC6 and ApproxMC4. 
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Fig. 3. Comparison of approximate counts from ApproxMCé6 to exact counts from 
Ganak. 


sorted in ascending order by the number of solutions, and the y-axis represents 
the number of solutions in a log scale. Theoretically, the approximate count 
from ApproxMCé6 should be within the range of |sol(F)|-1.8 and |sol(F)|/1.8 with 
probability 0.999, where |sol(F)| denotes the exact count returned by Ganak. 
The range is indicated by the upper and lower bounds, represented by the 


curves y = |sol(F)|- 1.8 and y = |sol(F)|/1.8, respectively. Figure3 shows 
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that the approximate counts from ApproxMCé fall within the expected range 
[|sol(F)|/1.8, |sol(F)| - 1.8] for all instances except for four points slightly above 
the upper bound. These four outliers are due to a bug in the preprocessor Arjun 
that probably depends on the version of the C++ compiler and will be fixed 
in the future. We also calculated the observed error, which is the mean relative 
difference between the approximate and exact counts in our experiments, i.e., 
max{finalEstimate/|sol(F)| — 1, |sol(F)|/finalEstimate — 1}. The overall observed 
error was 0.1, which is significantly smaller than the theoretical error tolerance 
of 0.8. 


7 Conclusion 


In this paper, we addressed the scalability challenges faced by ApproxMC in 
the smaller 6 range. To this end, we proposed a rounding-based algorithm, 
ApproxMC6, which reduces the number of estimations required by 84% while 
providing the same (£, 6)-guarantees. Our empirical evaluation on 1890 instances 
shows that ApproxMC6 solved 204 more instances and achieved a reduction in 
PAR-2 score of 1063s. Furthermore, ApproxMC6 achieved a 4x speedup over 
ApproxMC on the instances both ApproxMC6 and ApproxMC could solve. 
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A Proof of Proposition 1 


Proof. For Vy € {0,1}",a0™ € {0,1}™, let Yy acm) be an indicator variable that 
is 1 when h’™(y) = a™. According to the definition of strongly 2-universal 
function, we obtain Vx, y € {0,1}”,E [ya] = gx and E [e a) Yy am] = 
sm: To prove Eq. 1, we obtain 


\(F 
E [Cnt] =E] J Wwam] = J Elam] = so 


yeEsol(F) y€sol(F) 
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To prove Eq. 2, we derive 


E [Crt Ze) =E| X Veet JO Wam Iyam 


yesol(F) r#y€sol(F) 
= E > Yy, am) 5 E [Yr am) j Vya) | 
yEsol(F) Te 
_ [sol(F)|(]soI(F)| — 1) 
= E [Cnty F m) | T Jam 


Then, we obtain 
r 7 2 
g? [Cnt Fm] =E [Crt rm | -E [Cnt (rm) ] 


= E [Cnt ym] + LELE -1) e 


|sol(F)| 


22m 


=E [Cnt rm) | = 


<E [Cnt rm) | 


B Weakness of Proposition 3 


The following proposition states that Proposition 3 provides a loose upper bound 
for Pr [Error;]. 


Proposition 4. Assuming t is odd, we have: 
Pr [Errori] < n(t, [t/2], Pr [L U UJ) 


Proof. We will now construct a case counted by n(t, [t/2], Pr [LU U]) but not 
contained within the event Error,. Let I” be an indicator variable that is 1 


|sol(F)| 
T+ 


when ApproxMCCore returns a nSols less than = » indicating the occurrence 


of event L in the i-th repetition. Let JY be an indicator variable that is 1 
when ApproxMCCore returns a nSols greater than (1 + €)|sol(F)|, indicating the 
occurrence of event U in the i-th repetition. Consider a scenario where I} = 1 
for i = 1,2, [$] df = 1 for j = [$] +L. [$] , and I} = I = 0 
for k > [4]. n(t, [t/2],Pr[LUU]) represents $$ (IŻ v IY) > [4]. We can 
see that this case is included in $`‘, (ZZ v IY) > [5] and therefore counted 
by n(t, [t/2],Pr[LUU)]) since there are [£] estimates outside the PAC range. 
However, this case means that [4 ] estimates fall within the range less than Iso 


and | £]—| 4] estimates fall within the range greater than (1+<)|sol(F)|, while the 


remaining |4| estimates correctly fall within the range Ae, (1 + €)|sol(F)]]. 
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Therefore, after sorting all the estimates, ApproxMC6 returns a correct estimate 


since the median falls within the PAC range [xe (1+ e)|sol(F)|]. In other 


words, this case is out of the event Error,. In conclusion, there is a scenario that 
is out of the event Error,, undesirably included in expression $t, (IP VIY) > [4] 
and counted by 7(t, [t/2], Pr [L U U]), which means Pr [Error] is strictly less than 
n(t, [t/2], Pr [L U U]). 


C Proof of Pmaz < 0.36 for ApproxMC 


Proof. We prove the case of v2 — 1 < e < 1. Similarly to the proof in Sect. 5.3, 
we aim to bound Pr [L] by the following equation: 


Priġ=]| |) (TnT L) (3 revisited) 


which can be simplified by three observations labeled O1, O2 and O3 below. 
O1 : Vi < m* — 3, T; C Tj41. Therefore, 


(Ti1 NT; N Li) C U Ti C Tm*—3 
i€{1,...,m*—3} i€{1,...,m*—3} 

O2 : For i € {m* — 2,m* — 1}, we have 

U (Ti-1 N T; A Li) C Lm*-2 U Lm» -1 
ic{m*—2,m*—1} 

03 : Vi > m*, T; implies Cntyp;) > thresh and then we have 2f x 
Cntr > 2° x thresh > |sol(F)| (1 + i): The second inequal- 
ity follows from m* > _ log, |sol(F)| — log, (pivot). Then we obtain 
(Cnt 2 >E [Cnt i) (1+ =)) Therefore, T; C U! for i > m*. Since 
Vi, T; C T;—1, we have 

U WangaL) U Tit U (miN Tine N Lm) 
ic{m*,.. n} i€{m*41,...,.n} 
CTU (Tmi OT ee lage) 
C Toe I Ln 
C Ups U Lm» 
Following the observations O1, O2 and O3, we simplify Eq. 3 and obtain 
Pr [L] < Pr [Tin*—3] + Pr [Lm 2] + Pr [Lm 1] + Pr [Uj,« U Lm] 


Employing Lemma 2 in [8] gives Pr [L] < 0.36. Note that U in [8] represents U’ 
of our definition. 
Then, following the O4 and O5 in Sect. 5.3, we obtain 


Pr [U] < Pr [Uj] 
Employing Lemma 6 gives Pr [U] < 0.169. As a result, Pmaz < 0.36. 
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D Proof of Lemma 4 


We restate the lemma below and prove the statements section by section. The 
proof for J2—-1 <e< 1 has been shown in Sect. 5.3. 


Lemma 4. The following bounds hold for ApproxMC6: 


0.262 ife<V/2-1 
0.157 if/2-l<e<1 
Pr[L] < < 0.085 ifl<e<3 
0.055 if3<e<4/2-1 
0.023 ife>4/2-1 


wl if 
Pr[U] < 0.169 i E<3 
0.044 ife>3 


D.1 Proof of Pr [L] < 0.262 for e < V2 — 1 


We first consider two cases: E [Cntr m+] < *#£thresh and E [Cntr m+] > 
+££th resh, and then merge the results to complete the proof. 


Case 1: E [Cntip.m*)| < +E thresh 
Lemma 7. Given e < /2—1, the following bounds hold: 


1. Pr[Tin*—2] < z 


29,67 
2. Pr|Lm*-1] < 10.84 


Proof. Let’s first prove the statement 1. For e < 2 — 1, we have 
thresh < (2 — X2) pivot and E [Cnty F m*—2)] > 2pivot. Therefore, Pr |Tm*—-2] < 


Pr [Cntr m -2 <(1- v2)E [Cntr m -3]]: Finally, employing Lemma 5 with 


_1_ v2 : 2 1 < 1 
B = 1 g7» We obtain Pr [Tin —2] < 1+(%2)2-2pivot = 1+(%)2-2-9.84- (1+ =)? < 
g To prove the statement 2, we employ Lemma 5 with 8 = I and 


. 1 
E [Cntir m1] > pivot to obtain Pr[|Lm*-1ı] < La Enea a] < 


1 


1 = 
1+(1— z4z)?-9.84-(1+4)? ~ 10.84" 


Then, we prove that Pr [L] < 0.126 for E [Cntir, m] < +£ thresh. 


Proof. We aim to bound Pr [L] by the following equation: 
Prif}=| U Geinhn2,) (3 revisited) 


which can be simplified by the three observations labeled O1, O2 and O3 below. 
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O1: Vi < m* — 2,7; C Tj41. Therefore, 


(Ti NT; Ly) C U Ti C Tm-2 


02 : For i = m* — 1, we have 
Tm*—2 N Tm*—1 N Lim=-1 = Lim=-1 


O3 : Vi = m*, since rounding Cntr; up to v1¥2 pivot, we have Cntr i > 


2 
ATZ EfCnti r m* E[Cnt (mi 
vite pivot > Mesh > [anal El nati l. The second last inequality 


E 
follows from E [Cntr m] < tte thresh. Therefore, L; = @ for i > m* and we 
have 


(Ti—1 NM T; N Lj) = 0) 


Following the observations O1, O2 and O3, we simplify Eq. 3 and obtain 


Pr [L] < Pr [Tm 2] + Pr [Lm 1] 


Employing Lemma 7 gives Pr [L] < 0.126. 


Case 2: E[Cntyp,,+)| > *f£ thresh 


Lemma 8. Given E [Cnt (ms) > +££ thresh, the following bounds hold: 


1. Pr|Tm 1] < 
2. Pr|Lm] < z5 


Proof. Let’s first prove the statement 1. From E [Cntir m»; | > +££ thresh, 
we can derive E[Cntypm*—1)] > (1 + e)thresh. Therefore, Pr [Tm] < 


Pr [Crt deme < 1E [Cnt m+] Finally, employing Lemma 5 with 6 = 


1 * 1 1 
— * < < = 
The? we obtain Pr [Tri —1] = jie pz )?-E[Crt (,m*—1)] > 14+0- Tye )?-(+e)thresh 


“BACT F2E) < wa: To prove the statement 2, we employ Lemma 5 


with B = ik and E [Cnty ms) | > +££ thresh to obtain Pr[Lm+] < 
1 1 
< 


1 1 
< = . 
1+(1— gH)? E[Cntir m] T 1+0- g)? HE thresh I+4.92(1+2e) > 5.92 


m. 
f 


Then, we prove that Pr [L] < 0.262 for E [Cntr m+] > 4£thresh. 


Proof. We aim to bound Pr [L] by the following equation: 
Prig=] U (TnTna L) (3 revisited) 


which can be simplified by the three observations labeled O1, O2 and O3 below. 
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O1: Vi < m* —1,7; C Tj41. Therefore, 
U Ti NT; N Li) C U Ti © Tm 
i€{1,...,m*—1} i€{1,...,.m*—1} 
O2: For i = m*, we have 
Tm*-1 N Tm” N Lm” C Lm" 


vi+2e = 


O3 : Vi > m* +1, since rounding Cnt;p;, up to pivot and m* 


logs |sol(F)| — log, (pivot), we have 2’ x Cntypiy > gm “41 x VIE? pivot 


V1 + 2e|sol(F)| > Jeo) . Then we have (Crta > sensal), Therefore, 


Iv IV 


L; = 9 for i > m* Pe 


U GeintnL)=6 


i€{m*+1,...,.n} 
Following the observations O1,O2 and O3, we simplify Eq. 3 and obtain 


Pr [L] < Pr [Tm 1] + Pr [Lm] 


Employing Lemma 8 gives Pr [L] < 0.262. 


Combining the Case 1 and 2, we obtain Pr [L] < max{0.126, 0.262} = 0.262. 
Therefore, we prove the statement for ApproxMC6: Pr [L] < 0.262 for e < /2—1. 


D.2 Proof of Pr [L] < 0.085 for 1 <e<3 


Lemma 9. Given 1 < €< 3, the following bounds hold: 


1. Pr|Tm*-4] 36 


3. Pr|Lm*-2] < 20.68 


A 


Proof. Let’s first prove the statement 1. For e < 3, we have 
thresh < {pivot and E [Cnty m«—4)| > 8pivot. Therefore, Pr |Tm*—4] 
Pr [Cnt (F,m*—4) S SE [Cnt F,m*—4)]]- Pinal employing Lamma 5 with 8 
we obtain Pr IT mal < 


IA ILIA 


7 
= ILL 8pivot = FUL eee Ce )2 
. To prove the statement 2, we employ Lemma 5 with da = and 


86. T m 


E [Cnt ae) > Apivot to obtain Pr [|Lm*-3] < na eaa] < 


id= + gs07 = w Following the proof of Deamna 2 in [8] we can 


prove the statement 3. 


Now let us prove the statement for ApproxMCé: Pr [L] < 0.085 for 1 < € <3. 
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Proof. We aim to bound Pr [L] by the following equation: 


Pr [L] = U (Tia NT; N Li) (3 revisited) 


which can be simplified by the three observations labeled O1, O2 and O3 below. 


O1: Vi < m* —4,T; C Tj41. Therefore, 


U (Tain TA Li) C U TCT ei 


O2 : For i € {m* — 3,m* — 2}, we have 


U CE O T; A Li) C Lm»-3 U Dg 
ic{m*—3,m*—2} 


O3 : Vi > m* — 1, since rounding Cntr: up to pivot and m* > logs |sol(F)| — 


logs (pivot), we have 2’ x Cntypiy > 2m*=1 x pivot > Is0XF)| > DI, The 


last inequality follows from € > 1. Then we have (coe, Fj) = eral). 


Therefore, L; = Ø for i > m* — 1 and we have 


i€{m*—1,...,n} 


Following the observations O1,O2 and O3, we simplify Eq.3 and obtain 


Pr [L] < Pr [Zm«—a] + Pr [Lm*—3] + Pr [Lm-<—2] 


Employing Lemma 9 gives Pr [L] < 0.085. 


D.3 Proof of Pr [L] < 0.055 for 3 < e < 4V2 — 1 
Lemma 10. Given 3 < £ < 4V2 -— 1, the following bound hold: 


1 


4 ee < — 
"[Tm--s] < i819 


Proof. For € < 4V2 — 1, we have thresh < (2 — v2) pivot and E [Cnt Fm —3)] > 
4pivot. Therefore, Pr[|Tm*-3] < Pr [Cntr m -3 <(4-— {2E [Cntirm*-3)] l- 


Finally, employing Lemma 5 with 6 = 4 — y2 
1 1 


IA 


2 32? 


we obtain Pr [Tin«—3] 
1 


< < ‘ 
1+(1—(4—¥2))?-4pivot — 1+(1-(4-¥B))?-4-9.84-(14 >4)? — 18.19 


Now let us prove the statement for ApproxMCé6: Pr [L] < 0.055 for 3 < € < 
4/2 —1. 
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Proof. We aim to bound Pr [L] by the following equation: 


Prif}=| U G@einhn4,) (3 revisited) 


which can be simplified by the two observations labeled O1 and O2 below. 


O1: Vi < m* — 3,7; C Tj41. Therefore, 


(Ti NT; N Li) C U Ti C Tm*-3 


O2 : Vi > m* — 2, since rounding Cntr; to pivot and m* > log, |sol(F)| — 
|sol(F)| |sol(F) 


log, (pivot), we have 2f x Cntr > 2” ~? x pivot > SAO > eth The 
last inequality follows from € > 3. Then we have (cnt Fj) 2 eee), 
Therefore, L; = @ for i > m* — 2 and we have 
U (Tiz1 NG NA Li) =0 
i€{m*—2,...,n} 


Following the observations O1 and O2, we simplify Eq.3 and obtain 


Pr [L] < Pr [Tin«—3] 


Employing Lemma 10 gives Pr [L] < 0.055. 


D.4 Proof of Pr [L] < 0.023 for e > 4/2—1 


Lemma 11. Given £ > 4/2 — 1, the following bound hold: 


m*—4| < == 
Pr[Tme—al S 3558 
Proof. We have thresh < 2pivot and E [Cnt F m*—4)] > 8pivot. Therefore, 
Pr [Tm* 4] < Pr [Cntr m*—4) Š 1E [Cntr m*—4)]]- Finally, employing Lemma 5 


: 1 : 1 
with 6 = 3, we obtain Pr [Tm»—4] < 140l) Sprot © IFO—1)789.84 Â Bae 


Now let us prove the statement for ApproxMC6: Pr [L] < 0.023 for € > 4,/2—1. 


Proof. We aim to bound Pr [L] by the following equation: 


Priġ=] U (TnT L) (3 revisited) 


which can be simplified by the two observations labeled O1 and O2 below. 
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O1: Vi < m* —4,T; C Tj41. Therefore, 
(Ti AT, Li) C U T; © Tn —4 


O2 : Vi > m* — 3, since rounding Cnt;p,;. to V2pivot and m* > log, |sol(F)| — 


log, (pivot), we have 2’ x Cntipiy > 277% x V2pivot > As! > 


oe The last inequality follows from € > 4/2 — 1. Then we have 
(cata > ergal), Therefore, L; = Í for i > m* — 3 and we have 


U Gantt) =9 


Following the observations O1 and O2, we simplify Eq.3 and obtain 


Pr [L] < Pr [Tin«—a] 


Employing Lemma 11 gives Pr [L] < 0.023. 


D.5 Proof of Pr[U] < 0.169 for e < 3 


Lemma 12 


Proof. Employing Lemma 5 with y = (1+ $z) and E [Cntr m»; | > pivot/2, we 


obtain Pr |U}, +] < L, < poems < p 
btain Pr [U+] < 1+ (qee)"pivet/2 ~ 149.8472 S 5.92 


Now let us prove the statement for ApproxMCé6: Pr [U] < 0.169 for € < 3. 


Proof. We aim to bound Pr [U] by the following equation: 


Prij= |) U Gangny,) (4 revisited) 


We derive the following observations O1 and O2. 


O1 : Vi < m* — 1, since m* < log, |sol(F)| — log, (pivot) + 1, we have 
2° x Cntr < 2™"-1 x thresh < |sol(F)| (1+ i): Then we obtain 
(Cnt < E [Cntr] (1 + i)): Therefore, T; N U} = @ for i < m* —1 


and we have 


U Gannnwc U Gann) =0 
i€{1,...,.m*—1} é€ {1,...,.m*—1} 
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O2 : Vi > m*, T; implies Cntr > thresh and then we have 2° x 
Cntr > 2" x thresh > lsol(F)| (1 + 722). The second inequal- 
ity follows from m* > _ log, |sol(F)| — logs (pivot). Then we obtain 
(Cnt 2 >E [Cnt i) (1+ r)): Therefore, T; C U! for i > m*. Since 


1+e 
Vi, T; C T;_1, we have 


U Gangnvyjc |) — Taru (Dmi N Tm N Um) 
ic{m*,.. n} i€{m*+1,...,n} 

C Tm* U (Tm*—1 N Tm* N Um«) 

C Tm” U Um* 

C Ume (8) 
Remark that for € < v2 — 1, we round Cnt F m*) up to ¥ ite pivot and we 
have 2” x VI+? pivot < |sol(F)|(1 + €). For vV2— 1 < € < 1, we round 
Cnt F,m*) up to H and we have 2™* x Da < |sol(F)|(1 +£). For 1 < € < 3, 


we round Cnt;pm*) up to pivot and we have 2™" x pivot < |sol(F)|(1 + €). 
The analysis means rounding doesn’t affect the event Um» and therefore 
Inequality 8 still holds. 


Following the observations O1 and O2, we simplify Eq. 4 and obtain 
Pr [U] < Priel 


Employing Lemma 12 gives Pr [U] < 0.169. 


D.6 Proof of Pr [U] < 0.044 for e > 3 


Lemma 13 


1 
Pr [Tine +1 < 33.14 


Proof. Since E [Cntr m*+1)] < pivot, we have Pr [Tnx +1] < 
Pr [Crt ome) > 21+ ee [Crt rm 41] | Employing Lemma 5 with y = 
2(1 + 7&) and E[Cntyemey1)] > St, we obtain Pr [Tm] < 
i 2 1 < 1 <i 
1+(1+ 28) pivot/4  142.46-(3+})7 T 142.4637 > 23.14" 
Now let us prove the statement for ApproxMC6: Pr [U] < 0.044 for € > 3. 

Proof. We aim to bound Pr [|U] by the following equation: 

Prj=| U Gangny) (4 revisited) 


We derive the following observations O1 and O2. 
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O1:Vi < m*+1, for 3 < € < 42-1, because we round Cnt, ri) to pivot and have 


m* < log, |sol(F)| — log, (pivot) +1, we obtain 2" x Cntr < 2™ +! x pivot < 
4-|sol(F)| < (1+e)|sol(F)|. For € > 4,/2—1, we round Cnt;p.;) to V2pivot and 
obtain 2¢ x Cnt yeni) < 2" +! x V2pivot < 4V2-|sol(F)| < (1+¢)|sol(F)|. Then, 
we obtain Cntyp;) < E [Cnt (i) (1+). Therefore, U; = 9 fori < m* +1 
and we have 


U Gantny) =6 
w€{1,...,m*+1} 


O2: Vi > m* + 2, since Vi, T; C T;—1, we have 


U (Tin Ti. Ui) C U Ty-1 © Tm*41 
iE {m*+2,...,n} i€{m*+2,...,n} 


Following the observations O1 and O2, we simplify Eq. 4 and obtain 


Employing Lemma 13 gives Pr [U] < 0.044. 


Pr [U] < Pr [Tin +1] 
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Abstract. We study satisfiability modulo the theory of finite fields and 
give a decision procedure for this theory. We implement our procedure 
for prime fields inside the cvc5 SMT solver. Using this theory, we con- 
struct SMT queries that encode translation validation for various zero 
knowledge proof compilers applied to Boolean computations. We evalu- 
ate our procedure on these benchmarks. Our experiments show that our 
implementation is superior to previous approaches (which encode field 
arithmetic using integers or bit-vectors). 


1 Introduction 


Finite fields are critical to the design of recent cryptosystems. For instance, 
elliptic curve operations are defined in terms of operations in a finite field. Also, 
Zero-Knowledge Proofs (ZKPs) and Multi-Party Computations (MPCs), pow- 
erful tools for building secure and private systems, often require key properties 
of the system to be expressed as operations in a finite field. 

Field-based cryptosystems already safeguard everything from our money 
to our privacy. Over 80% of our TLS connections, for example, use elliptic 
curves [4,66]. Private cryptocurrencies [32,59,89] built on ZKPs have billion- 
dollar market capitalizations [44,45]. And MPC protocols have been used to 
operate auctions [17], facilitate sensitive cross-agency collaboration in the US 
federal government [5], and compute cross-company pay gaps [8]. These systems 
safeguard our privacy, assets, and government data. Their importance justifies 
spending considerable effort to ensure that the systems are free of bugs that 
could compromise the resources they are trying to protect; thus, they are prime 
targets for formal verification. 

However, verifying field-based cryptosystems is challenging, in part because 
current automated verification tools do not reason directly about finite fields. 
Many tools use Satisfiability Modulo Theories (SMT) solvers as a back-end [9, 
27,33, 93,95]. SMT solvers [7,10, 12,20, 26,35,73,76,77| are automated reasoners 
that determine the satisfiability of formulas in first-order logic with respect to one 
or more background theories. They combine propositional search with specialized 
reasoning procedures for these theories, which model common data types such 
as Booleans, integers, reals, bit-vectors, arrays, algebraic datatypes, and more. 
© The Author(s) 2023 
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Since SMT solvers do not currently support a theory of finite fields, SMT-based 
tools must encode field operations using another theory. 

There are two natural ways to represent finite fields using commonly sup- 
ported theories in SMT, but both are ultimately inefficient. Recall that a finite 
field of prime order can be represented as the integers with addition and multi- 
plication performed modulo a prime p. Thus, field operations can be represented 
using integers or bit-vectors: both support addition, multiplication, and mod- 
ular reduction. However, both approaches fall short. Non-linear integer reason- 
ing is notoriously challenging for SMT solvers, and bit-vector solvers perform 
abysmally on fields of cryptographic size (hundreds of bits). 

In this paper, we develop for the first time a direct solver for finite fields 
within an SMT solver. We use well-known ideas from computer algebra (specifi- 
cally, Gröbner bases [21] and triangular decomposition [6,99]) to form the basis 
of our decision procedure. However, we improve on this baseline in two impor- 
tant ways. First, our decision procedure does not manipulate field polynomials 
(i.e., those of form X? — X). As expected, this results in a loss of completeness 
at the Grobner basis stage. However, surprisingly, this often does not matter. 
Furthermore, completeness is recovered during the model construction algorithm 
(albeit in a rather rudimentary way). This modification turns out to be crucial for 
obtaining reasonable performance. Second, we implement a proof-tracing mech- 
anism in the Grébner basis engine, thereby enabling it to compute unsatisfiable 
cores, which is also very beneficial in the context of SMT solving. Finally, we 
implement all of this as a theory solver for prime-order fields inside the cvc5 
SMT solver. 

To guide research in this area, we also give a first set of QF_FF (quantifier-free, 
finite field) benchmarks, obtained from the domain of ZKP compiler correctness. 
ZKP compilers translate from high-level computations (e.g., over Booleans, bit- 
vectors, arrays, etc.) to systems of finite field constraints that are usable by ZKPs. 
We instrument existing ZKP compilers to produce translation validation [86] ver- 
ification conditions, i.e. conditions that represent desirable correctness properties 
of a specific compilation. We give these compilers concrete Boolean computa- 
tions (which we sample at random), and construct SMT formulas capturing the 
correctness of the ZKP compilers’ translations of those computations into field 
constraints. We represent the formulas using both our new theory of finite fields 
and also the alternative theory encodings mentioned above. 

We evaluate our tool on these benchmarks and compare it to the approaches 
based on bit-vectors, integers, and pure computer algebra (without SMT). We 
find that our tool significantly outperforms the other solutions. Compared to the 
best previous solution (we list prior alternatives in Sect.7), it is 6x faster and 
it solves 2x more benchmarks. 

In sum, our contributions are: 


1. a definition of the theory of finite fields in the context of SMT; 

2. adecision procedure for this theory that avoids field polynomials and produces 
unsatisfiable cores; 

3. the first public theory solver for this theory (implemented in cvc5); and 
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4. the first set of QF_FF benchmarks, which encode translation validation queries 
for ZKP compilers on Boolean computations. 


In the rest of the paper, we discuss related work (§1.1), cover background 
and notation (§2), define the theory of finite fields (§3), give a decision procedure 
(§4), describe our implementation (§5), explain the benchmarks (§6), and report 
on experiments (§7). 


1.1 Related Work 


There is a large body of work on computer algebra, with many algorithms imple- 
mented in various tools [1,18,31,37,49,52,58,72, 100,101]. However, the focus 
in this work is on quickly constructing useful algebraic objects (e.g., a Grébner 
basis), rather than on searching for a solution to a set of field constraints. 

One line of recent work [54,55] by Hader and Kovacs considers SMT-oriented 
field reasoning. One difference with our work is that it scales poorly with field 
size because it uses field polynomials to achieve completeness. Furthermore, their 
solver is not public. 

Others consider verifying field constraints used in ZKPs. One paper surveys 
possible approaches [97], and another considers proof-producing ZKP compila- 
tion [24]. However, neither develops automated, general-purpose tools. 

Still other works study automated reasoning for non-linear arithmetic over 
reals and integers [3,23,25,29,47,60-62, 70, 74,96,98]. A key challenge is reason- 
ing about comparisons. We work over finite fields and do not consider compar- 
isons because they are used for neither elliptic curves nor most ZKPs. 

Further afield, researchers have developed techniques for verified algebraic 
reasoning in proof assistants [15,64,75,79], with applications to mathemat- 
ics [19,28,51,65] and cryptography [39,40,85,91]. In contrast, our focus is on 
fully automated reasoning about finite fields. 


2 Background 


2.1 Algebra 


Here, we summarize algebraic definitions and facts that we will use; see [71, 
Chapters 1 through 8] or [34, Part IV] for a full presentation. 


Finite Fields. A finite field is a finite set equipped with binary operations + 
and x that have identities (0 and 1 respectively), have inverses (save that there 
is no multiplicative inverse for 0), and satisfy associativity, commutativity, and 
distributivity. The order of a finite field is the size of the set. All finite fields have 
order q = p° for some prime p (called the characteristic) and positive integer e. 
Such an integer q is called a prime power. 

Up to isomorphism, the field of order q is unique and is denoted F4, or F when 
the order is clear from context. The fields F,a for d > 1 are called extension fields 
of F,. In contrast, F4 may be called the base field. We write F C G to indicate 
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that F is a field that is isomorphic to the result of restricting field G to some 
subset of its elements (but with the same operations). We note in particular that 
F, C Fa. A field of prime order p is called a prime field. 


Polynomials. For a finite field F and formal variables X1,..., Xx, F[X1,..., Xx] 
denotes the set of polynomials in X1,...,X, with coefficients in F. By taking 
the variables to be in F, a polynomial f € F|X1,..., Xx] can be viewed as a 
function from F* — F. However, by taking the variables to be in an extension 
G of F, f can also be viewed as function from G* > G. 

For a set of polynomials S = {fi,..., fm} C Fa[X1,...,X«], the set I = 
{gfi ++ Imfm : gi € Fq[X1,---,Xz]} is called the ideal generated by S and 
is denoted (f1,..., fm) or (S). In turn, S is called a basis for the ideal I. 

The variety of an ideal T in field G D F is denoted Vg(I), and is the set 
{x € G* : Vf € I, f(x) = 0}. That is, V(I) contains the common zeros of 
polynomials in J, viewed as functions over G. Note that for any set of polynomials 
S that generates I, Vg(I) contains exactly the common zeros of S in G. When 
the space G is just F, we denote the variety as V(I). An ideal J that contains 1 
contains all polynomials and is called trivial. 

One can show that if J is trivial, then V(I) = Ø. However, the converse does 
not hold. For instance, X? + 1 € F3[X] has no zeros in F3, but 1 ¢ (X? +1). 
But, one can also show that J is trivial iff for all extensions G of F, Vg (1) = 0. 

The field polynomial for field F, in variable X is X4 — X. Its zeros are all of 
F, and it has no additional zeros in any extension of F,. Thus, for an ideal J of 
polynomials in F[X1,...,X,] that contains field polynomials for each variable 
X;, I is trivial iff V(I) = 0. For this reason, field polynomials are a common tool 
for ensuring the completeness of ideal-based reasoning techniques [48, 54,97]. 


Representation. We represent F, as the set of integers {0,1,...,p — 1}, with 
the operations + and x performed modulo p. The representation of Fpe with 
e > 1 is more complex. Unfortunately, the set {0,1,...,p® — 1} with + and x 
performed modulo pê is not a field because multiples of p do not have multi- 
plicative inverses. Instead, we represent Fpe as the set of polynomials in FLX] 
of degree less than e. The operations + and x are performed modulo q(X), an 
irreducible polynomial! of degree e |71, Chapter 6]. There are pê such polynomi- 
als, and so long as g(X) is irreducible, all (save 0) have inverses. Note that this 
definition of F,- generalizes F,, and captures the fact that F, C Fpe. 


p? 


2.2 Ideal Membership 


The ideal membership problem is to determine whether a given polynomial p is 
in the ideal generated by a given set of polynomials D. We summarize definitions 
and facts relevant to algorithms for this problem; see [30] for a full presentation. 


Monomial Ordering. In F|X1,..., Xx], a monomial is a polynomial of form 
Xi- Xr with non-negative integers e;. A monomial ordering is a total order- 
ing on monomials such that for all monomials p, q,r, if p < q, then pr < qr. 


1 Recall that an irreducible polynomial cannot be factored into two or more non- 
constant polynomials. 
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The lexicographical ordering for monomials X;"---X;," orders them lexico- 
graphically by the tuple (e1,..., ex). The graded-reverse lexicographical (grevlex) 
ordering is lexicographical by the tuple (e1 +--+ ek, €1,..-, €k). With respect 
to an ordering, Im( f) denotes the greatest monomial of a polynomial f. 


Reduction. For polynomials p and d, if Im(d) divides a term t of p, then we say 
that p reduces to r modulo d (written p >a r) for r = p— mad For a set of 
polynomials D, we write p >p r if p >q r for some d € D. Let —7, be the 
transitive closure of +p. We define p =p r to hold when p >ù} r and there is 
no r’ such that r >p r’. 

Reduction is a sound—but incomplete—algorithm for ideal membership. 
That is, one can show that p =p 0 implies p € (D), but the converse does 
not hold in general. 


Gröbner Bases. Define the s-polynomial for polynomials p and q, by spoly(p, q) = 
p- Im(q) — q: Im(p). A Gröbner basis (GB) [21] is a set of polynomials P char- 
acterized by the following equivalent conditions: 


1. Yp, p' € P, spoly(p, p’) =p 0 (closure under the reduction of s-polynomials) 
2. Vp € (P), p =p 0 (reduction is a complete test for ideal membership) 


Gröbner bases are useful for deciding ideal membership. From the first charac- 
terization, one can build algorithms for constructing a Gröbner basis for any 
ideal [21]. Then, the second characterization gives an ideal membership test. 
When P is a GB, the relation =p is a function (i.e., +p is confluent), and it 
can be efficiently computed [1,21]; thus, this test is efficient. 

A Grébner basis engine takes a set of generators G for some ideal J and 
computes a Gröbner basis for I. We describe the high-level design of such engines 
here. An engine constructs a sequence of bases Go, G1, G2,... (with Go = G) 
until some G; is a Grébner basis. Each G; is constructed from G;—ı according to 
one of three types of steps. First, for some p,q E€ Gi—1 such that spoly(p, q) >G; 
r #0, the engine can set G; = Gi—1 U {r}. Second, for some p € Gj_ 1 such that 
P >c,-1\{p} T Ż p, the engine can set G; = (Gi_1 \ {p}) U {r}. Third, for some 
p € Gj_1 such that p +¢,_,\{p} 0, the engine can set G; = Gi—1 \ {p}. Notice 
that all rules depend on the current basis; some add polynomials, and some 
remove them. In general, it is unclear which sequence of steps will construct a 
Groébner basis most quickly: this is an active area of research |1, 18,41, 43]. 


2.3 Zero Knowledge Proofs 


Zero-knowledge proofs allow one to prove that some secret data satisfies a public 
property, without revealing the data itself. See [94] for a full presentation; we 
give a brief overview here. There are two parties: a verifier V and a prover P. V 
knows a public instance x and asks P to show that it has knowledge of a secret 
witness w satisfying a public predicate ¢(x,w). To do so, P runs an efficient 
(i.e., polytime in a security parameter A) proving algorithm Prove(¢,x,w) — 7 
and sends the resulting proof m to V. Then, V runs an efficient verification 
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algorithm Verify(¢, x, 7) — {0,1} that accepts or rejects the proof. A system for 
Zero-Knowledge Proofs of knowledge (ZKPs) is a (Prove, Verify) pair with: 


— completeness: If ¢(x, w), then Pr[Verify(¢, 7, Prove(, x, w)) = 0] < negl(A),? 

— computational knowledge soundness [16]: (informal) a polytime adversary that 
does not know w satisfying @ can produce an acceptable m with probability 
at most negl(A). 

— zero-knowledge [|50]: (informal) m reveals nothing about w, other than its 
existence. 


ZKP applications are manifold. ZKPs are the basis of private cryptocurren- 
cies such as Zcash and Monero, which have a combined market capitalization 
of $2.80B as of 30 June 2022 [44,45]. They’ve also been proposed for auditing 
sealed court orders [46], operating private gun registries [63], designing privacy- 
preserving middleboxes [53] and more [22,56]. 

This breadth of applications is possible because implemented ZKPs are very 
general: they support any @¢ checkable in polytime. However, ¢ must be first 
compiled to a cryptosystem-compatible computation language. The most com- 
mon language is a rank-1 constraint system (R1CS). In an R1CS C, xz and w are 
together encoded as a vector z € F™. The system C is defined by three matrices 
A,B,C € F"*™; it is satisfied when Azo Bz = Cz, where o is the element- 
wise product. Thus, the predicate can be viewed as n distinct constraints, where 
constraint i has form ()7, Aij2j)()0,; Bizzi) — (QU; Cizzs) = 0. Note that each 
constraint is a degree < 2 polynomial in m variables that z must be a zero of. 
For security reasons, F must be large: its prime must have ~255 bits. 


Encoding. The efficiency of the ZKP scales quasi-linearly with n. Thus, it’s 
useful to encode ¢ as an R1CS with a minimal number of constraints. Since 
equisatifiability—not logical equivalence—is needed, encodings may introduce 
new variables. 

As an example, consider the Boolean computation a + c1 V ++- V Ck. Assume 
that ci,...,c, € F are elements in z that are 0 or 1 such that c; = (cœ; = 1). 
How can one ensure that a’ € F (also in z) is 0 or 1 and a © (a’ = 1)? 
Given that there are k — 1 ORs, natural approaches use O(k) constraints. One 
clever approach is to introduce variable x’ and enforce constraints x'()7, c,) = a’ 
and (1 —a’)()>,¢{) = 0. If any c; is true, a’ must be 1 to satisfy the second 
constraint; setting x’ to the sum’s inverse satisfies the first. If all c; are false, the 
first constraint ensures a’ is 0. This encoding is correct when the sum does not 
overflow; thus, k must be smaller than F’s characteristic. 

Optimizations like this can be quite complex. Thus, ZKP programmers use 
constraint synthesis libraries [14,69] or compilers [13, 24, 80, 81,84, 92, 102] to gen- 
erate an R1CS from a high-level description. Such tools support objects like 
Booleans, fixed-width integers, arrays, and user-defined data-types. The correct- 
ness of these tools is critical to the correctness of any system built with them. 


2 F(A) < negl(A) if for all c € N, f(A) = o(A~*). 
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2.4 SMT 


We assume usual terminology for many-sorted first order logic with equality 
( [88] gives a complete presentation). Let © be a many-sorted signature including 
a sort Bool and symbol family ~, (abbreviated ~) with sort ø x ø — Bool for 
allo in X. A theory is a pair T = (£, I), where © is a signature and I is a class 
of S-interpretations. A 4-formula ¢ is satisfiable (resp., unsatisfiable) in T if it 
is satisfied by some (resp., no) interpretation in I. Given a (set of) formula(s) S, 
we write S r ¢ if every interpretation M € I that satisfies S also satisfies ¢. 

When using the CDCL(T) framework for SMT, the reasoning engine for each 
theory is encapsulated inside a theory solver. Here, we mention the fragment of 
CDCL(T) that is relevant for our purposes ( [78] gives a complete presentation)). 

The goal of CDCL(T) is to check a formula ¢ for satisfiability. A core mod- 
ule manages a propositional search over the propositional abstraction of ¢ and 
communicates with the theory solver. As the core constructs partial proposi- 
tional assignments for the abstract formula, the theory solver is given the literals 
that correspond to the current propositional assignment. When the propositional 
assignment is completed (or, optionally, before), the theory solver must deter- 
mine whether its literals are jointly satisfiable. If so, it must be able to provide 
an interpretation in I (which includes an assignment to theory variables) that 
satisfies them. If not, it may indicate a strict subset of the literals which are 
unsatisfiable: an unsatisfiable core. Smaller unsatisfiable cores usually accelerate 
the propositional search. 


3 The Theory of Finite Fields 


We define the theory Tp, of the finite field F}, for any order q. Its sort and 
symbols are indexed by the parameter q; we omit q when clear from context. 

The signature of the theory is given in Fig. 1. It includes sort F, which intu- 
itively denotes the sort of elements of F, and is represented in our proposed 
SMT-LIB format as (_ FiniteField q). There is a constant symbol for each 
element of F}, and function symbols for addition and multiplication. Other finite 
field operations (e.g., negation, subtraction, and inverses) naturally reduce to this 
signature. 

An interpretation M of Tp, must interpret: F as Fj, n € {0,...,q— 1} 
as the nt? element of F, in lexicographical order,? + as addition in F}, x as 
multiplication in Fy, and © as equality in Fy. 

Note that in order to avoid ambiguity, we require that the sort of any constant 
ffn must be ascribed. For instance, the nè element of F} would be (as ffn 
(_ FiniteField q)). The sorts of non-nullary function symbols need not be 
ascribed: they can be inferred from their arguments. 


3 For non-prime Fpe, we use the lexicographical ordering of elements represented as 
polynomials in F,[X] modulo the Conway polynomial [83,90] Cp,-(X). This repre- 
sentation is standard [57]. 
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Symbol Arity SMT-LIB Description 

n € {0,...,¢—1} F ffn The n™ element of Fy 
+ FxF—4F  ff.add Addition in Fy 

x FxF—oF ff.mul Multiplication in F, 


Fig. 1. Signature of the theory of F, 


1 Function DecisionProcedure: 
Input: A set of F-literals L in variables X 
Output: UNSAT and a core C C L, or 
Output: SAT and a model M : X + F 
P 4+ empty set; W; + fresh, Vi; 
for si X; ti E€ L do 
if x; = ~ then P + PU {[s:] — [t:]} ; 
else if x; = # then P+ PU {W;([s:] — [t:]) — 1} ; 
B + GB(P); 
if 1 >, 0 then return UNSAT, CoreFromTree() ; 
m + FindZero(P); 
if m = | then return UNSAT, L ; 
else return SAT, {X =œ z: (X = z)em,X €X}; 


CMAN Da PWN 


m 
(=) 


Fig. 2. The decision procedure for F4. 


4 Decision Procedure 


Recall (§2.4) that a CDCL(T) theory solver for F must decide the satisfiability of 
a set of F-literals. At a high level, our decision procedure comprises three steps. 
First, we reduce to a problem concerning a single algebraic variety. Second, we 
use a GB-based test for unsatisfiability that is fast and sound, but incomplete. 
Third, we attempt model construction. Figure 2 shows pseudocode for the deci- 
sion procedure; we will explain it incrementally. 


4.1 Algebraic Reduction 


Let L = {f1,...,&r)} be a set of literals. Each F-literal has the form ¢; = s; > ti 
where s and ¢ are F-terms and ™ E€ {~,#%}. Let X = {X1,..., Xk} denote the 
free variables in L. Let E, D C {1,...,|L|} be the sets of indices corresponding to 
equalities and disequalities in L, respectively. Let [t] € F[X] denote the natural 
interpretation of F-terms as polynomials in F[X] (Fig.3). Let Pg C F[X] be the 
set of interpretations of the equalities; i.e., Pe = {[s:] — [t:i] hez. Let Pp c 
F[X] be the interpretations of the disequalities; i.e., Pp = {[s:] — [ti] }iep. The 
satisfiability of L reduces to whether V((Pr)) \ [Ure pp V((p))] is non-empty. 

To simplify, we reduce disequalities to equalities using a classic technique [88]: 
we introduce a fresh variable W; for each i € D and define Pp as 


Pp = {Wi([si] — lt:]) — Died 
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Fig. 3. Interpreting F-terms as polynomials 


Note that each p € Ph has zeros for exactly the values of X where its analog in 
Pp is not zero. Also note that Ph C F,[X’], with X’ = X U {Wi hep. 

We define P to be Pg UP) (constructed in lines 2 to 6, Fig. 2) and note three 
useful properties of P. First, L is satisfiable if and only if V((P)) is non-empty. 
Second, for any P’ C P, if V((P’)) = 0, then {7(p) : p € P’} is an unsatisfiable 
core, where 7 maps a polynomial to the literal it is derived from. Third, from 
any x € V((P)) one can immediately construct a model. Thus, our theory solver 
reduces to understanding properties of the variety V((P)). 


4.2 Incomplete Unsatisfiability and Cores 


Recall (§2.2) that if 1 € (P), then V((P)) is empty. We can answer this ideal 
membership query using a Grébner basis engine (line 7, Fig.2). Let GB be a 
subroutine that takes a list of polynomials and computes a Grébner basis for the 
ideal that they generate, according to some monomial ordering. We use grevlex: 
the ordering for which GB engines are typically most efficient [42]. We compute 
GB(P) and check whether 1 =>gg(p) 0. If so, we report that V((P)) is empty. If 
not, recall (§2.2) that V((P)) may still be empty; we proceed to attempt model 
construction (lines 9 to 11, Fig. 2, described in the next subsection). 

If 1 does reduce by the Grébner basis, then identifying a subset of P which 
is sufficient to reduce 1 yields an unsatisfiable core. To construct such a subset, 
we formalize the inferences performed by the Grébner basis engine as a calculus 
for proving ideal membership. 

Figure 4 presents IdealCalc: our ideal membership calculus. IdealCalc proves 
facts of the form p € (P), where p is a polynomial and P is the set of generators 
for an ideal. The G rule states that the generators are in the ideal. The Z rule 
states that 0 is in the ideal. The S rule states that for any two polynomials in 
the ideal, their s-polynomial is in the ideal too. The R} and R} rules state that 
if p —ą r with q in the ideal, then p is in the ideal if and only if r is. 

The soundness of IdealCalc follows immediately from the definition of an ideal. 
Completeness relies on the existence of algorithms for computing Grébner bases 
using only s-polynomials and reduction [21,41,43]. We prove both properties in 
Appendix A. 


Theorem 1 (IdealCalcSoundness). If there exists an IdealCalc proof tree with 
conclusion p € (P), then p € (P). 


Theorem 2 (IdealCalcCompleteness). If p € (P), then there exists an 
IdealCalc proof tree with conclusion p € (P). 
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peP R ré(P) qe(P) par 
0 € (P) pe(P) * pE (P) 
pe (P) qé«(P) R pe (P) qe(P) par 
spoly(p,q) € (P)  * r € (P) 


Fig. 4. IdealCalc: a calculus for ideal membership 


1 Function FindZero: 

Input: A Gröbner basis B C F[X’] 

Input: A partial map M : X’ > F (empty by default) 

Output: A total map M : X’ > F or L 

if 1 € (B) then return L ; 

if |M]| = |X’| then return M ; 

for (Xj +> z) € ApplyRule(B, M) do 
r + FindZero(GB(BU {Xj — z}), MU {X; > 2}); 
if r # | then return r; 

return L 


Noa Ph WN 


Fig.5. Finding common zeros for a Grébner basis. After handling trivial cases, 
FindZero uses ApplyRule to apply the first applicable rule from Fig. 6. 


By instrumenting a Grdbner basis engine and reduction engine, one can con- 
struct IdealCalc proof trees. Then, for a conclusion 1 € (P), traversing the proof 
tree to its leaves gives a subset P’ C P such that 1 € (P’). The procedure 
CoreFromTree (called in line 8, Fig.2) performs this traversal, by accessing a 
proof tree recorded by the GB procedure and the reductions. The proof of The- 
orem 2 explains our instrumentation in more detail (Appendix A). 


4.3 Completeness Through Model Construction 


As discussed, we still need a complete decision procedure for determining if 
V((P)) is empty. We call this procedure FindZero; it is a backtracking search 
for an element of V((P)). It also serves as our model construction procedure. 

Figure 5 presents FindZero as a recursive search. It maintains two data struc- 
tures: a Gröbner basis B and partial map M : X’ — F from variables to field 
elements. By applying a branching rule (which we will discuss in the next para- 
graph), FindZero obtains a disjunction of single-variable assignments X; > z, 
which it branches on. FindZero branches on an assignment Xj +> z by adding it 
to M and updating B to GB(BU {X} — z}). 

Figure6 shows the branching rules of FindZero. Each rule comprises 
antecedents (conditions that must be met for the rule to apply) and a conclusion 
(a disjunction of single-variable assignments to branch on). The Univariate rule 
applies when B contains a polynomial p that is univariate in some variable X; 
that M does not have a value for. The rule branches on the univariate roots 
of p. The Triangular rule comes from work on triangular decomposition [68]. It 
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pEB peFX!]) Xi¢M Z+ UnivariateZeros(p) 
Vzez(Xi Ei z) 


Dim((B))=0 X;¢ M p< MinPoly(B,Xj) Z + UnivariateZeros(p) 
Vzez(Xi re z) 


Univariate 


Triangular 


Exhaust 


V er Vixrem (Xt = z) 


Fig. 6. Branching rules for FindZero. 


applies when B is zero-dimensional.* It computes a univariate minimal poly- 
nomial p(X/) in some unassigned variables X/, and branches on the univariate 
roots of p. The final rule Exhaust has no conditions and simply branches on all 
possible values for all unassigned variables. 

FindZero’s ApplyRule sub-routine applies the first rule in Fig. 6 whose condi- 
tions are met. The other subroutines (GB [21,41,43], Dim [11], MinPoly [2], and 
UnivariateZeros [87]) are commonly implemented in computer algebra libraries. 
Dim, MinPoly, and UnivariateZeros run in (randomized) polytime. 


Theorem 3 (FindZeroCorrectness). If V((B)) = then FindZero returns L; 
otherwise, it returns a member of V((B)). (Proof: Appendix B) 


Correctness and Efficiency. The branching rules achieve a careful balance 
between correctness and efficiency. The Exhaust rule is always applicable, but 
a full exhaustive search over a large field is unreasonable (recall: ZKPs operate 
of ~255-bit fields). The Triangular and Univariate rules are important alternatives 
to exhaustion. They create a far smaller set of branches, but apply only when 
the variety has dimension zero or the basis has a univariate polynomial. 

As an example of the importance of Univariate, consider the univariate system 
X? =2, in a field where 2 is not a perfect square (e.g., F7). X? — 2 is already a 
(reduced) Grébner basis, and it does not contain 1, so FindZero applies. With 
the Univariate rule, FindZero computes the univariate zeros of X? — 2 (there are 
none) and exits. Without it, the Exhaust rule creates |F| branches. 

As an example of when Triangular is critical, consider 


Site Re ee ef 
XıXə + XoX3+ X3X4+ X4X5 4+ X5X, =0 
X1XoX3 + XQX3X4 + X3X4X5 + X4X5X1 + X5X1X2 = 0 
XıX2X3X4 + X2X3X4X5 + X3X4X5X1 + X4X5X1ı X2 + X5X1X2X3 = 0 
oe Ge =1 


* The dimension of an ideal is a natural number that can be efficiently computed from 
a Grobner basis. If the dimension is zero, then one can efficiently compute a minimal 
polynomial in any variable X, given a Grébner basis [2,68]. 
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in Fs94357 [68]. The system is unsatisfiable, it has dimension 0, and its ideal 
does not contain 1. Moreover, our solver computes a (reduced) Grébner basis 
for it that does not contain any univariate polynomials. Thus, Univariate does 
not apply. However, Triangular does, and with it, FindZero quickly terminates. 
Without Triangular, Exhaust would create at least |F| branches. 

In the above examples, Exhaust performs very poorly. However, that is not 
always the case. For example, in the system Xı + Xə = 0, using Exhaust to guess 
Xı, and then using the univariate rule to determine Xə is quite reasonable. In 
general, Exhaust is a powerful tool for solving underconstrained systems. Our 
experiments will show that despite including Exhaust, our procedure performs 
quite well on our benchmarks. We reflect on its performance in Sect. 8. 


Field Polynomials: A Road not Taken. By guaranteeing completeness through 
(potential) exhaustion, we depart from prior work. Typically, one ensures com- 
pleteness by including field polynomials in the ideal (§2.2). Indeed, this is the 
approach suggested [97] and taken [55] by prior work. However, field polynomials 
induce enormous overhead in the Grébner basis engine because their degree is 
so large. The result is a procedure that is only efficient for tiny fields [55]. In 
our experiments, we compare our system’s performance to what it would be if 
it used field polynomials. The results confirm that deferring completeness to 
FindZero is far superior for our benchmarks. 


5 Implementation 


We have implemented our decision procedure for prime fields in the cvc5 SMT 
solver [7] as a theory solver. It is exposed through cvc5’s SMT-LIB, C++, Java, 
and Python interfaces. Our implementation comprises ~2k lines of C++. For the 
algebraic sub-routines of our decision procedure (§4), it uses CoCoALib [1]. To 
compute unsatisfiable cores (§4.2), we inserted hooks into CoCoALib’s Grébner 
basis engine (17 lines of C++). 

Our theory solver makes sparse use of the interface between it and the rest 
of the SMT solver. It acts only once a full propositional assignment has been 
constructed. It then runs the decision procedure, reporting either satisfiability 
(with a model) or unsatisfiability (with an unsatisfiable core). 


6 Benchmark Generation 


Recall that one motivation for this work is to enable translation validation for 
compilers to field constraint systems (R1CSs) used in zero-knowledge proofs 
(ZIPs). Our benchmarks are SMT formulas that encode translation validation 
queries for compilers from Boolean computations to R1CS. At a high level, each 
benchmark is generated as follows. 


5 We add field polynomials to our procedure on line 2, Fig. 2. This renders our ideal 
triviality test (lines 7 and 8) complete, so we can eliminate the fallback to FindZero. 
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Sample a Boolean formula W in v variables with t non-variable terms. 
Compile © to R1CS using ZoKrates [36], CirC [81], or ZoK-CirC [81]. 
Optionally remove some constraints from the R1CS. 

Construct a formula ¢ in QF_FF that tests the soundness (all assignments satis- 
fying the R1CS agree with WV) or determinism (the inputs uniquely determine 
the output) of the RICS. 

5. Optionally encode ¢ in QF_BV, in QF_NIA, or as (Boolean-free) F-equations. 


PwnNr 


Through step 3, we construct SMT queries that are satisfiable, unsatisfiable, and 
of unknown status. Through step 5, we construct queries solvable using bit-vector 
reasoning, integer reasoning, or a stand-alone computer algebra system. 


6.1 Examples 


We describe our benchmark generator in full and give the definitions of soundness 
and determinism in Appendix C. Here, we give three example benchmarks. Our 
examples are based on the Boolean formula W(21, 22,273,074) = 11 V T2 V z3 V 
x4. Our convention is to mark field variables with a prime, but not Boolean 
variables. Using the technique from Sect. 2.3, CirC compiles this formula to the 
two-constraint system: i's! = r’A(1—r’)s’ = 0 where s’ = = ‘. Each Boolean 
input x; corresponds to field element x; and r’ corresponds to the result of Y. 


Soundness. An RICS is sound if it ensures the output r’ corresponds to the value 
of Y (when given valid inputs). Concretely, our system is sound if the following 
formula is valid: 


Vi.(x; =0V z; =1)A^ (z; =1 Ss wi) Ni's =r A(1—-1’)s' =0 
inputs are correct constraints hold 
= 
(Lea0Ve HDAC a1 4> Į) 


output is correct 


where W and s’ are defined as above. This is an UNSAT benchmark, because the 
formula is valid. 


Determinism. An RICS is deterministic if the values of the inputs uniquely 
determine the value of the output. To represent this in a formula, we use two 
copies of the constraint system: one with primed variables, and one with double- 
primed variables. Our example is deterministic if the following formula is valid: 


Vie, =n Ate or Are = 0 Ais = r” ALA le’ =U 
eS Teese 
inputs agree constraints hold for both systems 
=> 
r! = r" 
—S 


outputs agree 
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Unsoundness. Removing constraints from the system can give a formula that is 
not valid (a SAT benchmark). For example, if we remove (1—r’)s’ = 0, then the 
soundness formula is falsified by {4 > T, x; 1,’ 0,7’ = 0}. 


7 Experiments 


Our experiments show that our approach: 


scales well with the size of F (unlike a BV-based approach), 

would scale poorly with the size of F if field polynomials were used, 
benefits from unsatisfiable cores, and 

. substantially outperforms all reasonable alternatives. 


per Cor ho 


Our test bed is a cluster with Intel Xeon E5-2637 v4 CPUs. Each run is 
limited to one physical core, 8GB memory, and 300s. 

Throughout, we generate benchmarks for two correctness properties (sound- 
ness and determinism), three different ZKP compilers, and three different sta- 
tuses (sat, unsat, and unknown). We vary the field size, encoding, number of 
inputs, and number of terms, depending on the experiment. We evaluate our 
cvc5 extension, Bitwuzla (commit 27£6291), and z3 (version 4.11.2). 


300 


System _~ 200 
a — bv-bitwuzla = System 
g -= bv-cvc5 £ e ®  bv-bitwuzla 
E ==: bv-z3 s A ff-cvc5 
400 V-Z: 5 100 $ CVC: 
== foes 
e e 
ee 
aA AAAA 
o 6 as a2aa Aa 
o) 50 100 150 200 20 40 60 
Solved instances Bits 
(a) Instances solved (b) Total solve time for (field-based) cvc5 


and (BV-based) Bitwuzla on commonly 
solved instances at all bit-widths. 


Fig. 7. The performance of field-based and BV-based approaches (with various BV 
solvers) when the field size ranges from 5 to 60 bits. 


7.1 Comparison with Bit-Vectors 


Since bit-vector solvers scale poorly with bit-width, one would expect the effec- 
tiveness of a BV encoding of our properties to degrade as the field size grows. To 
validate this, we generate BV-encoded benchmarks for varying bit-widths and 
evaluate state-of-the-art bit-vector solvers on them. Though our applications of 
interest use b = 255, we will see that the BV-based approach does not scale to 
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Table 1. Solved small-field benchmarks by tool, property, and status. 
determinism soundness total 
system unsat | unk. | sat | unsat | unk. | sat | timeout | memout | solved 
by-bitwuzla 4 16 | 29 | 28 32 |36| 71 0 145 
bv-cvc5 5 11 |36 |25 25 |29 78 7 131 
by-z3 5 9 |14 |25 25 |29 100 9 107 
ff-cvc5 36 36 |36 |36 36 |36 0 0 216 
all benchmarks 36 36 |36 |36 36 |36 216 


fields this large. Thus, for this set of experiments we use b € {5,10,...,60}, and 
we sample formulas with 4 inputs and 8 intermediate terms. 

Figure 7a shows performance of three bit-vector solvers (cvc5 [7], Bitwu- 
zla [76], and z3 [73]) and our F solver as a cactus plot; Table 1 splits the solved 
instances by property and status. We see that even for these small bit-widths, 
the field-based approach is already superior. The bit-vector solvers are more 
competitive on the soundness benchmarks, since these benchmarks include only 
half as many field operations as the determinism benchmarks. 

For our benchmarks, Bitwuzla is the most efficient BV solver. We further 
examine the time that it and our solver take to solve the 9 benchmarks they can 
both solve at all bit-widths. Figure 7b plots the total solve time against b. While 
the field-based solver’s runtime is nearly independent of field size, the bit-vector 
solvers slow down substantially as the field grows. 

In sum, the BV approach scales poorly with field size and is already inferior 
on fields of size at least 27°. 


7.2 The Cost of Field Polynomials 


Recall that our decision procedure does not use field polynomials (§4.3), but our 
implementation optionally includes them (§5). In this experiment, we measure 
the cost they incur. We use propositional formulas in 2 variables with 4 terms, 
and we take b € {4,...,12}, and include SAT and unknown benchmarks. 

Figure 8a compares the performance of our tool with and without field poly- 
nomials. For many benchmarks, field polynomials cause a slowdown greater than 
100x. To better show the effect of the field size, we consider the solve time 
for the SAT benchmarks, at varying values of b. Figure8b shows how solve 
times change as b grows: using field polynomials causes exponential growth. For 
UNSAT benchmarks, both configurations complete within 1s. This is because 
(for these benchmarks) the GB is just {1} and CoCoA’s GB engine is good at 
discovering that (and exiting) without considering the field polynomials. 

This growth is predictable. GB engines can take time exponential (or worse) 
in the degree of their inputs. A simple example illustrates this fact: consider 
computing a Grdbner basis with X?’ — X and X? — X. The former reduces to 
0 modulo the latter, but the reduction takes 2’ — 1 steps. 
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Fig. 8. Solve times, with and without field polynomials. The field size varies from 4 to 
12 bits. The benchmarks are all SAT or unknown. 
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Fig. 9. The performance of alternative algebra-based approaches. 


7.3 The Benefit of UNSAT Cores 


Section 4.2 describes how we compute unsatisfiable (UNSAT) cores in the F 
solver by instrumenting our Gröbner basis engine. In this experiment, we mea- 
sure the benefit of doing so. We generate Boolean formulas with 2, 4, 6, 8, 10, 
and 12 variables; and 2°, 2!, 22, 23, 24, 2°, 26, and 27 intermediate terms, for 
a 255-bit field. We vary the number of intermediate terms widely in order to 
generate benchmarks of widely variable difficulty. We configure our solver with 
and without GB instrumentation. 

Figure 9a shows the results. For many soundness benchmarks, the cores cause 
a speedup of more than 10x. As expected, only the soundness benchmarks ben- 
efit. Soundness benchmarks have non-trivial boolean structure, so the SMT core 
makes many queries to the theory solver. Returning good UNSAT cores shrinks 
the propositional search space, reduces the number of theory queries, and thus 
reduces solve time. However, determinism benchmarks are just a conjunction 
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Fig. 10. A comparison of all approaches. 


of theory literals, so the SMT core makes only one theory query. For them, 
returning a good UNSAT core has no benefit—but also induces little overhead. 


7.4 Comparison to Pure Computer Algebra 


In this experiment, we compare our SMT-based approach (which inte- 
grates computer-algebra techniques into SMT) against a stand-alone use of 
computer-algebra. We encode the Boolean structure of our formulas in F, (see 
Appendix C). When run on such an encoding, our SMT solver makes just one 
query to its field solver, so it cannot benefit from the search optimizations present 
in CDCL(T). For this experiment, we use the same benchmark set as the last. 
Figure9b compares the pure F approach with our SMT-based approach. 
For benchmarks that encode soundness properties, the SMT-based approach is 
clearly dominant. The intuition here is is that computer algebra systems are not 
optimized for Boolean reasoning. If a problem has non-trivial Boolean structure, 
a cooperative approach like SMT has clear advantages. SMT’s advantage is less 
pronounced for determinism benchmarks, as these manifest as a single query to 
the finite field solver; still, in this case, our encoding seems to have some benefit 


much of the time. 


7.5 Main Experiment 


In our main experiment, we compare our approach against all reasonable alter- 
natives: a pure computer-algebra approach (§7.4), a BV approach with Bitwuzla 
(the best BV solver on our benchmarks, §7.1), an NIA approach with cvc5 and 
z3, and our own tool without UNSAT cores (§7.3). We use the same benchmark 
set as the last experiment; this uses a 255-bit field. 

Figure 10 shows the results as a cactus plot. Table2 shows the number of 
solved instances for each system, split by property and status. Bitwuzla quickly 
runs out of memory on most of the benchmarks. A pure computer-algebra app- 
roach outperforms Bitwuzla and cvc5’s NIA solver. The NIA solver of z3 does a 
bit better, but our field-aware SMT solver is the best by far. Moreover, its best 
configuration uses UNSAT cores. Comparing the total solve time of ff-cvc5 and 
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Table 2. Solved benchmarks by tool, property, and status. 


determinism soundness total 
system unsat | unk. | sat | unsat | unk. | sat | timeout | memout solved 
bv-bitwuzla 7 8 |16 | 34 52 | 52) 127 568 169 
ff-cve5 94 78 |78 |135 |137 | 137) 168 37 659 
ff-cvc5-nocore | 94 78 |78 |123 |125 136) 193 37 634 
nia-cvc5 1 29 |41 8 25 | 46|714 0 150 
nia-z3 2 30 |55 | 66 70 73) 568 0 296 
pureft-cvc5 84 T4 | 75 6 15 10 | 532 68 264 
all benchmarks | 144 |144 | 144/144 144 144 864 
nia-z3 on commonly solved benchmarks, we find that ff-cvc5 reduces total solve 


time by 6x. In sum, the techniques we describe in this paper yield a tool that 
substantially outperforms all alternatives on our benchmarks. 


8 Discussion and Future Work 


We’ve presented a basic study of the potential of an SMT theory solver for finite 
fields based on computer algebra. Our experiments have focused on translation 
validation for ZKP compilers, as applied to Boolean input computations. The 
solver shows promise, but much work remains. 

As discussed (Sect. 5), our implementation makes limited use of the interface 
exposed to a theory solver for CDCL(T). It does no work until a full propositional 
assignment is available. It also submits no lemmas to the core solver. Exploring 
which lightweight reasoning should be performed during propositional search 
and what kinds of lemmas are useful is a promising direction for future work. 

Our model construction (Sect. 4.3) is another weakness. Without univariate 
polynomials or a zero-dimensional ideal, it falls back to exhaustive search. If a 
solution over an extension field is acceptable, then there are O(|F|%) solutions, 
so an exhaustive search seems likely to quickly succeed. Of course, we need a 
solution in the base field. If the base field is closed, then every solution is in the 
base field. Our fields are finite (and thus, not closed), but for our benchmarks, 
they seem to bear some empirical resemblance to closed fields (e.g., the GB-based 
test for an empty variety never fails, even though it is theoretically incomplete). 
For this reason, exhaustive search may not be completely unreasonable for our 
benchmarks. Indeed, our experiments show that our procedure is effective on our 
benchmarks, including for SAT instances. However, the worst-case performance 
of this kind of model construction is clearly abysmal. We think that a more 
intelligent search procedure and better use of ideas from computer algebra [6, 67] 
would both yield improvement. 

Theory combination is also a promising direction for future work. The bench- 
marks we present here are in the QF_FF logic: they involve only Booleans and finite 
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fields. Reasoning about different fields in combination with one another would 
have natural applications to the representation of elliptic curve operations inside 
ZKPs. Reasoning about datatypes, arrays, and bit-vectors in combination with 
fields would also have natural applications to the verification of ZKP compilers. 
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A Proofs of IdealCalc Properties 


This appendix is available in the full version of the paper [82]. 


B Proof of Correctness for FindZero 


We prove that FindZero is correct (Theorem 3). 


Proof. It suffices to show that for each branching rule that results in V/ ;(Xi, —r;), 


V((B)) c Uvs U {X} — 753) 


J 


First, consider an application of Univariate with univariate p(X;). Fix z € 
V((B)). z is a zero of p, so for some j, r; = z and z € V((BU {X; — z})). 

Next, consider an application of Triangular to variable X; with minimal poly- 
nomial p(X;). By the definition of minimal polynomial, any zero z of (B) has a 
value for X; that is a root of p. Let that root be r. Then, z € V((BU {X; — z})). 

Finally, consider an application of Exhaust. The desired property is immedi- 
ate. 


C Benchmark Generation 


This appendix is available in the full version of the paper [82]. 
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Abstract. String solvers are automated-reasoning tools that can solve 
combinatorial problems over formal languages. They typically operate on 
restricted first-order logic formulas that include operations such as string 
concatenation, substring relationship, and regular expression matching. 
String solving thus amounts to deciding the satisfiability of such formu- 
las. While there exists a variety of different string solvers, many string 
problems cannot be solved efficiently by any of them. We present a new 
approach to string solving that encodes input problems into propositional 
logic and leverages incremental SAT solving. We evaluate our approach 
on a broad set of benchmarks. On the logical fragment that our tool 
supports, it is competitive with state-of-the-art solvers. Our experiments 
also demonstrate that an eager SAT-based approach complements exist- 
ing approaches to string solving in this specific fragment. 


1 Introduction 


Many problems in software verification require reasoning about strings. To tackle 
these problems, numerous string solvers—automated decision procedures for 
quantifier-free first-order theories of strings and string operations—have been 
developed over the last years. These solvers form the workhorse of automated- 
reasoning tools in several domains, including web-application security [19,31,33], 
software model checking [15], and conformance checking for cloud-access-control 
policies [2,30]. 

The general theory of strings relies on deep results in combinatorics 
on words [5,16,23,29]; unfortunately, the related decision procedures remain 
intractable in practice. Practical string solvers achieve scalability through a judi- 
cious mix of heuristics and restrictions on the language of constraints. 

We present a new approach to string solving that relies on an eager reduc- 
tion to the Boolean satisfiability problem (SAT), using incremental solving and 
unsatisfiable-core analysis for completeness and scalability. Our approach sup- 
ports a theory that contains Boolean combinations of regular membership con- 
straints and equality constraints on string variables, and captures a large set of 
practical queries [6]. 
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Our solving method iteratively searches for satisfying assignments up to a 
length bound on each string variable; it stops and reports unsatisfiability when 
the search reaches computed upper bounds without finding a solution. Similar to 
the solver WOORPJE [12], we formulate regular membership constraints as reach- 
ability problems in nondeterministic finite automata. By bounding the number 
of transitions allowed by each automaton, we obtain a finite problem that we 
encode into propositional logic. To cut down the search space of the under- 
lying SAT problem, we perform an alphabet reduction step (SMT-LIB string 
constraints are defined over an alphabet of 3 - 216 letters and a naive reduction 
to SAT does not scale). Inspired by bounded model checking [8], we iteratively 
increase bounds and utilize an incremental SAT solver to solve the resulting 
series of propositional formulas. We perform an unsatisfiable-core analysis after 
each unsatisfiable incremental call to increase only the bounds of a minimal 
subset of variables until a theoretical upper bound is reached. 

We have evaluated our solver on a large set of benchmarks. The results show 
that our SAT-based approach is competitive with state-of-the-art SMT solvers 
in the logical fragment that we support. It is particularly effective on satisfiable 
instances. 

Closest to our work is the WOORPJE solver [12], which also employs an eager 
reduction to SAT. WooRPJE reduces systems of word equations with linear con- 
straints to a single Boolean formula and calls a SAT solver. An extension can 
also handle regular membership constraints [21]. However, WOORPJE does not 
handle the full language of constraints considered here and does not employ the 
reduction and incremental solving techniques that make our tool scale in prac- 
tice. More importantly, in contrast to our solver, WOORPJE is not complete—it 
does not terminate on unsatisfiable instances. 

Other solvers such as Hampi [19] and Kaluza [31] encode string problems 
into constraints on fixed-size bit-vector, which can be solved by reduction to 
SAT. These tools support expressive constraints but they require a user-provided 
bound on the length of string variables. 

Further from our work are approaches based on the lazy SMT paradigm, 
which tightly integrates dedicated, heuristic, theory solvers for strings using 
the CDCL(T) architecture (also called DPLL(T) in early papers). Solvers that 
follow this paradigm include OSTRICH [11], Z3 [25], Z3sTR4 [24], cvc5 [3], 
Z3STR3RE [7], TRAU [1], and CERTISTR [17]. Our evaluation shows that our 
eager approach is competitive with lazy solvers overall, but it also shows that 
combining both types of solvers in a portfolio is most effective. Our eager app- 
roach tends to perform best on satisfiable instances while lazy approaches work 
better on unsatisfiable problems. 


2 Preliminaries 


We assume a fixed alphabet X and a fixed set of variables I’. Words of X* 
are denoted by w, w’, w”, etc. Variables are denoted by x,y,z. Our decision 
procedure supports the theory described in Fig. 1. 
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F:=FVF|FAF|-7F| Atom 
Atom :=x € RE|x=y 
RE := REU RE | RE- RE | RE* | REQ RE]|? |w 


Fig. 1. Syntax: x and y denote string variables and w denotes a word of X*. The symbol 
? is the wildcard character. 


Atoms in this theory include regular membership constraints (or regular con- 
straints for short) of the form x € RE, where RE is a regular expression, and 
variable equations of the form x = y. Concatenation is not allowed in equations. 

Regular expressions are defined inductively using union, concatenation, inter- 
section, and the Kleene star. Atomic regular expressions are constant words 
w € X* and the wildcard character ?, which is a placeholder for an arbitrary 
symbol c € X. All regular expressions are grounded, meaning that they do not 
contain variables. We use the symbols ¢ and # as a shorthand notation for 
negations of atoms using the respective predicate symbols. The following is an 
example formula in our language: a(x € a-?* Ay € ?*-b) Vx#yVxea-b. 

Using our basic syntax, we can define additional relations, such as constant 


equations x = w, and prefix and suffix constraints, written w E x and w J x, 
respectively. Even though these relations can be expressed as regular constraints 
(e.g., the prefix constraint ab E x can be expressed as x € a- b - ?*), we can 
generate more efficient reductions to SAT by encoding them explicitly. 

This string theory is not as expressive as others, since it does not include 
string concatenation, but it still has important practical applications. It is used 
in the ZELKOVA tool described by Backes, et al. [2] to support analysis of AWS 
security policies. ZELKOVA is a major industrial application of SMT solvers [30]. 

Given a formula Y, we denote by atoms(w) the set of atoms occurring in Y, 
by V(w) the set of variables occurring in Y, and by X(wW) the set of constant 
symbols occurring in w. We call X(w) the alphabet of p. Similarly, given a 
regular expression R, we denote by X(R) the set of characters occurring in R. 
In particular, we have X(?) = @. 

We call a formula conjunctive if it is a conjunction of literals and we call 
it a clause if it is a disjunction of literals. We say that a formula is in nor- 
mal form if it is a conjunctive formula without unnegated variable equations. 
Every conjunctive formula can be turned into normal form by substitution, i.e., 
by repeatedly rewriting Y Ax = y to w|x := y]. If ọ is in negation normal 
form (NNF), meaning that the negation symbol occurs only directly in front of 
atoms, we denote by lits(y) the set of literals occurring in vw. We say that an 
atom a occurs with positive polarity in w if a € lits(W) and that it occurs with 
negative polarity in w if ma € lits(w); we denote the respective sets of atoms 
of Y by atomst(y) and atoms~ (y). The notion of polarity can be extended to 
arbitrary formulas (not necessarily in NNF), intuitively by considering polarity 
in a formula’s corresponding NNF (see [26] for details). 
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Definitions: D 
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i Alphabet X Propositional a 
h jisai 
" Reduction m} Encoding |, UNSA’ 
b 
Bounds: b 
Bound Bound 
eturn UNS 
Initialization Refinement > Return UNSAT 


Fig. 2. Overview of the solving process. 


The semantics of our language is standard. A regular expression R defines a 
regular language £(R) over X in the usual way. An interpretation is a mapping 
(also called a substitution) h: T — X* from string variables to words. Atoms 
are interpreted as usual, and a model (also called a solution) is an interpretation 
that makes a formula evaluate to true under the usual semantics of the Boolean 
connectives. 


3 Overview 


Our solving method is illustrated in Fig. 2. It first performs three preprocessing 
steps that generate a Boolean abstraction of the input formula, reduce the size of 
the input alphabet, and initialize bounds on the lengths of all string variables. 
After preprocessing, we enter an encode-solve-and-refine loop that iteratively 
queries a SAT solver with a problem encoding based on the current bounds 
and refines the bounds after each unsatisfiable solver call. We repeat this loop 
until either the propositional encoding is satisfiable, in which case we conclude 
satisfiability of the input formula, or each bound has reached a theoretical upper 
bound, in which case we conclude unsatisfiability. 


Generating the Boolean Abstraction. We abstract the input formula w by replac- 
ing each theory atom a € atoms(~) with a new Boolean variable d(a), and keep 
track of the mapping between a and d(a). This gives us a Boolean abstraction 
wa of w and a set D of definitions, where each definition expresses the rela- 
tionship between an atom a and its corresponding Boolean variable d(a). If a 
occurs with only one polarity in Y, we encode the corresponding definition as an 
implication, i.e., as d(a) — a or as ~d(a) — 7a, depending on the polarity of a. 
Otherwise, if a occurs with both polarities, we encode it as an equivalence con- 
sisting of both implications. This encoding, which is based on ideas behind the 
well-known Plaisted-Greenbaum transformation [28], ensures that the formulas 
wand Wa A Naden 4 are equisatisfiable. An example is shown in Fig. 3. 


Reducing the Alphabet. In the SMT-LIB theory of strings [4], the alphabet X 
comprises 3 - 216 letters, but we can typically use a much smaller alphabet with- 
out affecting satisfiability. In Sect.4, we show that using (ww) and one extra 
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xe Ry y Z R2 


(a) Input Formula w (b) Boolean Abstraction w4 


Fig. 3. Example of Boolean abstraction. The formula Y, whose expression tree is shown 
on the left, results in the Boolean abstraction illustrated on the right, where p, q, and r 
are fresh Boolean variables. We additionally get the definitions p —> x € R1, q > y € Ra, 
and r = z + w. We use an implication (instead of an equivalence) for atom x € Ri 
since it occurs only with positive polarity within w. 


character per string variable is sufficient. Reducing the alphabet is critical for 
our SAT encoding to be practical. 


Initializing Bounds. A model for the original first-order formula 7 is a substi- 
tution h:  — X* that maps each string variable to a word of arbitrary length 
such that ~ evaluates to true. As we use a SAT solver to find such substitu- 
tions, we need to bound the lengths of strings, which we do by defining a bound 
function b : I’. — N that assigns an upper bound to each string variable. We 
initialize a small upper bound for each variable, relying on simple heuristics. If 
the bounds are too small, we increase them in a later refinement step. 


Encoding, Solving, and Refining Bounds. Given a bound function b, we build a 
propositional formula [wy that is satisfiable if and only if the original formula w 
has a solution A such that |h(x)| < b(x) for all x € I’. We encode [w]” as the 
conjunction %4 A [D]? A [h]”, where 4 is the Boolean abstraction of Y, [D]? 
is an encoding of the definitions D, and [pA]? is an encoding of the set of possible 
substitutions. We discuss details of the encoding in Sect.5. A key property is 
that it relies on incremental SAT solving under assumptions [13]. Increasing 
bounds amounts to adding new clauses to the formula [wy and fixing a set of 
assumptions, i.e., temporarily fixing the truth values of a set of Boolean variables. 
If [vy]? is satisfiable, we can construct a substitution h from a Boolean model 
w of iy]. Otherwise, we examine an unsatisfiable core (i.e., an unsatisfiable 
subformula) of [v]” to determine whether increasing the bounds may give a 
solution and, if so, to identify the variables whose bounds must be increased. In 
Sect. 6, we explain in detail how we analyze unsatisfiable cores, increase bounds, 
and conclude unsatisfiability. 
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4 Reducing the Alphabet 


In many applications, the alphabet X is large—typically Unicode or an approxi- 
mation of Unicode as defined in the SMT-LIB standard—but formulas use much 
fewer symbols (less than 100 symbols is common in our experiments). In order 
to check the satisfiability of a formula Y, we can restrict the alphabet to the 
symbols that occur in w and add one extra character per variable. This allows 
us to produce compact propositional encodings that can be solved efficiently in 
practice. 

To prove that such a reduced alphabet A is sufficient, we show that a model 
h: IT — X* of w can be transformed into a model h’: I — A* of w by replacing 
characters of X that do not occur in % by new symbols—one new symbol per 
variable of y. For example, suppose V(w) = {x1,x2}, ©(w) = {a, c,d}, and h is 
a model of w such that h(x,) = abcdef and h(x2) = abbd. We introduce two new 
symbols a1, a2 € X \ (y) , define h’(x1) = aaycdayay; and h’(x2) = aagaed , 
and argue that h’ is a model as well. 

More generally, assume B is a subset of X and n is a positive integer such 
that |B| < |X| — n. We can then pick n distinct symbols a1,...,@, from X \ B. 
Let A be the set BU {aj,...,@,}. We construct n functions fi,..., fn from X 
to A by setting fila) = a if a € B, and f;(a) = a; otherwise. We extend fi to 
words of X* in the natural way: fi(e) = £ and fi(a-w) = fila) - fi(w). This 
construction satisfies the following property: 


Lemma 4.1. Let fi,..., fn be mappings as defined above, and leti, j €1,...,n 
such that i #7. Then, the following holds: 


1. Ifa and b are distinct symbols of X, then fila) Æ f;(0). 
2. If w and w are distinct words of X*, then fi(w) 4 f;(w’). 


Proof. The first part is an easy case analysis. For the second part, we have 
that | fi(w)| = |w| and |f;(w’)| = |w’|, so the statement holds if w and w’ have 
different lengths. Assume now that w and w’ have the same length and let v be 
the longest common prefix of w and w’. Since w and w’ are distinct, we have 
that w= v-a-uand w =v-b-u’, where a Æ b are symbols of X and u and 4’ 
are words of X*. By the first part, we have fi(a) # f;(b), so fi(w) and f;(w’) 
must be distinct. 


The following lemma can be proved by induction on R. 


Lemma 4.2. Let fi,..., fn be mappings as defined above and let R be a regular 
expression with Y(R) C B. Then, for all words w € &* and alli € 1,...,n, 
w E€ L(R) if and only if fi(w) € L(R). 

Given a subset A of X, we say that w is satisfiable in A if there is a model 


h: V(w) — A* of y. We can now prove the main theorem of this section, which 
shows how to reduce the alphabet while maintaining satisfiability. 


Theorem 4.3. Let w be a formula with at most n string variables x,,...,Xn 
such that |X(w)| +n < |X|. Then, w is satisfiable if and only if it is satisfiable 
in an alphabet A C X of cardinality |A| = |E (Y)| +n. 
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Proof. We set B = X(w) and use the previous construction. So the alphabet 
A= BU {ay,...,Qn} has cardinality |X(Y)| + n, where ay,...a, are distinct 
symbols of X \ B. We can assume that w is in disjunctive normal form, meaning 
that it is a disjunction of the form Y = yı V -+ V Ym, where each yy is a 
conjunctive formula. If ~ is satisfiable, then one of the disjuncts wx is satisfiable 
and we have 1(~,) C B. We can turn Yp into normal form by eliminating all 
variable equalities of the form x; = x; from Yx, resulting in a conjunction Yx of 
literals of the form x; € R, x; ¢ R, or x; # xj. Clearly, for any A C X, yx is 
satisfiable in A if and only if wx is satisfiable in A. 

Let h: V(y,) > &* be a model of pp and define the mapping h’: V (pk) > 
A* as h’(x;) = fi(h(x;)). We show that h’ is a model of pp. Consider a literal | 
of pk. We have three cases: 


— l is of the form x; € R where X(R) C Y(w) = B. Since h satisfies yp, we 
must have h(x;) E€ £(R) so h’(x;) = fi(h(x:)) is also in L(R) by Lemma 4.2. 

— lis of the form x; ¢ R with X(R) C B. Then, h’(x;) g L(R) and we can 
conclude h’(x;) ¢ £(R) again by Lemma 4.2. 

— lis of the form x; # x,;. Since h satisfies Yp, we must have i # j and h(x;) 4 
h(x;), which implies h’(x;) = fi(h(xi)) A f;(A(x;)) = h'(x;) by Lemma 4.1. 


All literals of yx are then satisfied by h’, hence vy, is satisfiable in A and thus 
so is wz. It follows that ~ is satisfiable in A. 


The reduction presented here can be improved and generalized. For example, it 
can be worthwhile to use different alphabets for different variables or to reduce 
large character intervals to smaller sets. 


5 Propositional Encodings 


Our algorithm performs a series of calls to a SAT solver. Each call determines the 
satisfiability of the propositional encoding [wy of w for some upper bounds b. 
Recall that [wy]? = PAA jal” N^ [D]”, where Y4 is the Boolean abstraction of Y, 
ja]? is an encoding of the set of possible substitutions, and [D]? is an encoding 
of the theory-literal definitions, both bounded by b. Intuitively, [pn]? tells the 
SAT solver to “guess” a substitution, [D]? makes sure that all theory literals 
are assigned proper truth values according to the substitution, and Y4 forces 
the evaluation of the whole formula under these truth values. 

Suppose the algorithm performs n calls and let by : IT — N for k € 1,...,n 
denote the upper bounds used in the k-th call to the SAT solver. For conve- 
nience, we additionally define bo(x) = 0 for all x € I’. In the k-th call, the SAT 
solver decides whether [wy* is satisfiable. The Boolean abstraction Y4, which 
we already discussed in Sect.3, stays the same for each call. In the following, 
we thus discuss the encodings of the substitutions [rye and of the various the- 
ory literals Ja]? and [-a]”* that are part of [D]?*. Even though SAT solvers 
expect their input in CNF, we do not present the encodings in CNF to simplify 


194 K. Lotz et al. 


the presentation, but they can be converted to CNF using simple equivalence 
transformations. 

Most of our encodings are incremental in the sense that the formula for call 
k is constructed by only adding clauses to the formula for call k — 1. In other 
words, for substitution encodings we have [Ay’* = [Als 'A [A] pe _ and for 


literals we have [J]]?* = [uJ>*-? A es with the base case [A] = [J]?? = T 
In these cases, it is thus enough to encode the incremental additions Me, 


and [a], for each call to the SAT solver. Some of our encodings, however, 
introduce clauses that are valid only for a specific bound by and thus become 
invalid for larger bounds. We handle the deactivation of these encodings with 
selector variables as is common in incremental SAT solving. 

Our encodings are correct in the following sense.! 


Theorem 5.1. Let l be a literal and let b: I — N be a bound function. Then, 
l has a model that is bounded by b if and only if [rn]? A q” is satisfiable. 


5.1 Substitutions 


We encode substitutions by defining for each variable x € I the characters to 
which each of x’s positions is mapped. Specifically, given x and its corresponding 
upper bound b(x), we represent the substitution h(x) by introducing new vari- 
ables x[1],...,x[b(x)], one for each symbol h(x)[i] of the word h(x). We call these 
variables filler variables and we denote the set of all filler variables by I’. By 
introducing a new symbol A ¢ X, which stands for an unused filler variable, we 
can define h based on a substitution h: Č — X, over the filler variables, where 
Dy = LU {A}: 


E if h(x[i]) = à 
AON] = 45 0. E 
h(xļi]) otherwise 
We use this representation of substitutions (known as “filling the positions” [18]) 
because it has a straightforward propositional encoding: For each variable x € I 
and each position i € 1,...,b(x), we create a set {hyj, | a € Xa} of Boolean 
variables, where hy; is true if h(x[i]) = a. We then use a propositional encoding 
of an eractly-one (EO) constraint (e.g., [20]) to assert that exactly one variable 
in this set must be true: 
bx ( (x) 
[al =); A EO({hy la E Ly }) (1) 
xEľl i=be—-1(x)4+1 
bx ( (x)— 1 


AN A Rares (2) 


xe i=bķ-1(x) 


1 Proof is omitted due to space constraints but made available for review purposes. 
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Constraint (2) prevents the SAT solver from considering filled substitutions that 
are equivalent modulo A-substitutions—it enforces that if a position 7 is mapped 
to A, all following positions are mapped to A too. For instance, abAA, aAbd, 
and Adab all correspond to the same word ab, but our encoding allows only 
ab\\. Thus, every Boolean assignment w that satisfies [h]> encodes exactly one 
substitution hw, and for every substitution h (bounded by b) there exists a 
corresponding assignment wp that satisfies [h]?. 


5.2 Theory Literals 


The only theory literals of our core language are regular constraints (x € R) and 
variable equations (x = y) with their negations. Constant equations (x = w) as 


well as prefix and suffix constraints (w Č x and w I x) could be expressed as 
regular constraints, but we encode them explicitly to improve performance. 


Regular Constraints. We encode a regular constraint x € R by constructing 
a propositional formula that is true if and only if the word h(x) is accepted by a 
specific nondeterministic finite automaton that accepts the language £(R). Let 
x € R be a regular constraint and let M = (Q, X, ô, qo, F) be a nondeterministic 
finite automaton (with states Q, alphabet X, transition relation 0, initial state 
qo, and accepting states F’) that accepts £(R) and that additionally allows -self- 
transitions on every state. Given that is a placeholder for the empty symbol, 
A-transitions do not change the language accepted by M. We allow them so 
that M performs exactly b(x) transitions, even for substitutions of length less 
than b(x). This reduces checking whether the automaton accepts a word to only 
evaluating the states reached after exactly b(x) transitions. 

Given a model w = [A]?, we express the semantics of M in propositional logic 
by encoding which states are reachable after reading h,,(x). To this end, we assign 
b(x) + 1 Boolean variables {S}, 97, .. ., SP} to each state q € Q and assert 
that wn (5S4) = 1 if and only if q can be reached by reading prefix h,,(x)[1..i]. We 
encode this as a conjunction [(M;x)] = [uv] A [Tim] A [Pim] of three 
formulas, modelling the semantics of the initial state, the transition relation, 
and the predecessor relation of M. We assert that the initial state qo is the 
only state reachable after reading the prefix of length 0, i.e., [ç mx)" = 5° A 


qo 
Mco faol cre The condition is independent of the bound on x, thus we set 


[orodes_, =T for all k > 1. 

We encode the transition relation of M by stating that if M is in some state 
q after reading h,,(x)[1..i], and if there exists a transition from q to q’ labelled 
with an a, then M can reach state q’ after i + 1 transitions if h,,(x)[i+ 1] = a. 
This is expressed in the following formula: 


b(x)—1 


[Tmol = \ A A (S3 ^ hgti+1)) = Su 


i=br—1(x) (qa)€dom(5) q'€ô(q,a) 
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The formula captures all possible forward moves from each state. We must also 
ensure that a state is reachable only if it has a reachable predecessor, which we 
encode with the following formula, where pred(q’) = {(q, a) | q € 6(q,a)}: 


be (x) 
b i— a 
[Parol = \ \ (3 VV (STE A Agia) 
t=be-1(x)+1 'EQ (q,a)€pred(q’) 


The formula states that if state q’ is reachable after 1 > 1 transitions, then 
there must be a reachable predecessor state q E€ 0({go}, hu (x)[1..2 — 1]) such that 
q' € 6(q, hu (x)[i]). 

To decide whether the automaton accepts hu (x), we encode that it must 
reach an accepting state after b;(x) transitions. Our corresponding encoding 
is only valid for the particular bound b,;(x). To account for this, we introduce 
a fresh selector variable są and define [accept teas = oko Varer eh 9, 
Analogously, we define [reject é wle = Sk > Nayer” Aspe), In the k-th call 
to the SAT solver and all following calls with the same bound on x, we solve 
under the assumption that sẹ is true. In the first call k’ with bx(x) < bx (x), 
we re-encode the condition using a new selector variable są and solve under 
the assumption that s; is false and sj, is true. The full encoding of the regular 
constraint x € R is thus given by 


ke R] (=I le L^ [accept e let ; 
and its negation x ¢ R is encoded as 


[x g Ae 1 = [(M; x<] ^ [reject é mle,- 


Variable Equations. Let x,y € I be two string variables, let 1 = 
min(bz_1(x), bg_i(y)), and let u = min(b;(x),bz(y)). We encode equality 
between x and y with respect to by position-wise up to u: 


k=v = A A a> ha) 


i=l+1 aE X) 


The formula asserts that for each position i € l+ 1,...,u, if xfi] is mapped to a 
symbol, then yļi] is mapped to the same symbol eads a Since our encoding 
of substitutions ensures that every position in a string variable is mapped to 
exactly one character, [x = Wet, ensures xļi] = yļi] for i € 1+ 1,...,u. In 


conjunction with [x = yj, which encodes equality up to the l-th position, we 


have symbol-wise equality of x and y up to bound u. Thus, if b(x) = bą(y), 
then the formula ensures the equality of both variables. If b,(x) > bą(y), we add 
hju+1] as an assumption to the solver to ensure xļi] = A for i € u+1,..., bg (x) 


and, symmetrically, we add the assumption hà ytu+1] IE be(y) > be(x). 
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For the negation x Æ y, we encode that h(x) and h(y) must disagree on at least 
one position, which can happen either because they map to different symbols 
or because the variable with the higher bound is mapped to a longer word. As 
for the regular constraints, we again use selector variable są to deactivate the 
encoding for all later bounds, for which it will be re-encoded: 


sk > (Via Vaca, hig A hq) if be(x) = bey) 
[x # ylei 1 \ 3k (Vina Vaes, (“hiji ^ hig) V huti] if be (x) < be(y) 
Sk ` (Viči Vaes, (“hita A^ hea) V aoe aT if bk (x) > br (y) 


Constant Equations. Given a constant equation x = w, if the upper bound 
of x is less than |w], the atom is trivially unsatisfiable. Thus, for all ¿ such that 
bi(x) < |w], we encode x = w with a simple literal 7s, and add s, to the 
assumptions. For bz(x) > |w], the encoding is based on the value of by_1(x): 


es het if be-i() < wl = bad) 
pee gee = LAMAR A Many HE bea < hel < belo) 
| Aiwan if by—1(x) = |w| < be(x) 
g if |w] < bx-1(x) 
If by_i(x) < |w], then equality is encoded for all positions 1,...,|w|. Addition- 


ally, if bą(x) > |w|, we ensure that the suffix of x is empty starting from position 
|w| + 1. If by_1(x) = |w| < bx (x), then only the empty suffix has to be ensured. 
Lastly, if |w| < by_1(x), then [x = w]J™- © Ix = w]™ 

Conversely, for an inequality x # w, if b,(x) < |w], then any substitution 
trivially is a solution, which we simply encode with T. Otherwise, we introduce 
a selector variable s% „ and define 


Sew > VELA baa (0) < [eo] = bi) 
y b w : 
[x A wh: = Ve | aA v The] if by_1(x) < |w| < bz (x) 
T if jw] < bg_-1(x) < bz(x) 
If bą(x) = |w], then a substitution h satisfies the constraint if and only if 
h(x)[¢] 4 wli] for some i € 1,..., |w]. If bą(x) > |w], in addition, h satisfies the 
constraint if |h(x ) > |w|. Thus, if b,(x) = |w|, we perform solver call k under 


the assumption s} w, and if bą(x) > |w|, we perform it under the assumption 
Sx w: Again, if |w| < by—1(x), then [x 4 wt s [kx uy. 


Prefix and Suffix Constraints. A prefix constraint w E x expresses that 
the first |w| positions of x must be mapped exactly onto w. As with equations 
between a variable x and a constant word w, we could express this as a regular 
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constraint of the form x € w-?*. However, we achieve a more efficient encoding 
simply by dropping from the encoding of [x = w] the assertion that the suffix 
of x starting at |w + 1| be empty. Accordingly, a negated prefix constraint w Z x 
expresses that there is an index i € 1,...,|w| such that the i-th position of x 
is mapped onto a symbol different from w|?], which we encode by repurposing 


[x 4 w] in a similar manner. Suffix constraints w I x and w Z x can be encoded 
by analogous modifications to the encodings of x = w and x # w. 


6 Refining Upper Bounds 


Our procedure solves a series of SAT problems where the length bounds on string 
variables increase after each unsatisfiable solver call. The procedure terminates 
once the bounds are large enough so that further increasing them would be futile. 
To determine when this is the case, we rely on the upper bounds of a shortest 
solution to a formula Y. We call a model h of Y a shortest solution of w if ~ has 
no model h’ such that X erlk' (X)| < o,ep|h(x)|. We first establish this bound 
for conjunctive formulas in normal form, where all literals are of the form x ¥ y, 
x € R, or x Æ R. Once established, we show how the bound can be generalized 
to arbitrary formulas. 

Let y be a formula in normal form and let x1,...,Xn be the variables of y. 
For each variable x;, we can collect all the regular constraints on x;, that is, all 
the literals of the form x; € R or x; Z R that occur in y. We can characterize 
the solutions to all these constraints by a single nondeterministic finite automa- 
ton M;. If the constraints on x; are x; € Ry,...,x; € Ry, x; Z Rh....,x; ¢ Ri, 
then M; is an NFA that accepts the regular language Ni L(R) NA Nia L(R), 
where £(R) denotes the complement of £(R). We say that M; accepts the regu- 
lar constraints on x; in y. If there are no such constraints on x;, then M; is the 
one-state NFA that accepts the full language X*. Let Q; denote the set of states 
of M;. If we do not take inequalities into account and if the regular constraints 
on x; are satisfiable, then a shortest solution h has length |h(x;)| < |Qi|. 

Theorem 6.1 gives a bound for the general case with variable inequalities. 
Intuitively, we prove the theorem by constructing a single automaton P that 
takes as input a vector of words W = (w1, ..., Wn)? and accepts W iff the sub- 
stitution hw with hyw(x;) = w; satisfies y. To construct P, we introduce one 
two-state NFA for each inequality and we then form the product of these NFAs 
with (slightly modified versions of) the NFAs Mi, ..., Mn. We can then derive 
the bound of a shortest solution from the number of states of P. 


Theorem 6.1. Let y be a conjunctive formula in normal form over variables 
X1,+++5;Xn- Let Mi = (Qi, X, ĉi, do, Fi) be an NFA that accepts the regular con- 
straints on x; in p and let k be the number of inequalities occurring in p. If p 
is satisfiable, then it has a model h such that 


I(x) < 2* x [Ql x... x Qn: 
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Proof. Let A be a symbol that does not belong to X and define Xy = YU{A}. As 
previously, we use À to extend words of X* by padding. Given a word w € XX, we 
denote by w the word of X* obtained by removing all occurrences of À from w. 
We say that w is well-formed if it can be written as w = v- àt with v € X* and 
t > 0. In this case, we have w = v. Thus a well-formed word w consists of a 
prefix in X* followed by a sequence of As. 

Let A be the alphabet XY, i.e., the letters of A are the n-letter words over 
Xy. We can then represent a letter u of A as an n-element vector (u1,...,Un), 
and a word W of A? can be written as an n x t matrix 


Uin «++ Utn 


where uj; € ©. Each column of this matrix is a letter in A and each row is a word 


Se 


in X$. We denote by p;(W) the i-th row of this matrix and by p;(W) = p;(W) 
the word p;(W) with all occurrences of A removed. We say that W is well-formed 
if the words p|(W),...,Pn(W) are all well-formed. Given a well-formed word W, 
we can construct a mapping hw : {x1,...,Xn} > X* by setting hw (xi) = pi(W) 
and we have |hy(x;)| < |W] = t. 

To prove the theorem, we build an NFA P with alphabet A such that a well- 
formed word W is accepted by P iff hw satisfies y. The shortest well-formed 
W accepted by P has length no more than the number of states of P and the 
bound will follow. 

We first extend the NFA M; = (Qj, X, ĉi, qo,i, Fi) to an automaton M? with 
alphabet A. M! has the same set of states, initial state, and final states as Mj. 
Its transition relation ĝ; is defined by 


/ _ flg ui) ifu EX 


One can easily check that M; accepts a word W iff M; accepts p;(W). 
For an inequality x; # xj, we construct an NFA D; ; = ({e, d}, A, ô, e, {d}) 
with transition function defined as follows: 


d(e,u) = {e} if u; = uj 
ô(d, u) = {d}. 


This NFA has two states. It starts in state e (for “equal” ) and stays in e as long 
as the characters u; and uj are equal. It transitions to state d (for “different” ) 
on the first u where u; 4 uj and stays in state d from that point. Since d is the 
final state, a word W is accepted by D; ; iff pi(W) 4 p;(W). If W is well-formed, 
we also have that W is accepted by D; ; iff pi(W) # AW). 

Let xi, # Xj- - -Xip Æ Xj, denote the k inequalities of y. We define P to be 
the product of the NFAs M;i, ..., M} and Di jis- --, Dip, jp: A well-formed word 
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W is accepted by P if it is accepted by all M/ and all D;,;,, which means that 
P accepts a well-formed word W iff hw satisfies y. 

Let P be the set of states of P. We then have |P| < 2* x |Q,| x ... x |Qn|- 
Assume ¢ is satisfiable, so P accepts a well-formed word W. The shortest well- 
formed word accepted by P has an accepting run that does not visit the same 
state twice. So the length of this well-formed word W is no more than |P|. The 
mapping hy satisfies y and for every x;, it satisfies |hw(x;)| = |p;(W)| < |W] < 
P| <2 x [Qi] x... |Qnl- 


The bound given by Theorem 6.1 holds if y is in normal form but it also holds 
for a general conjunctive formula w. This follows from the observation that 
converting conjunctive formulas to normal form preserves the length of solutions. 
In particular, we convert w Ax = y to formula Y’ = [x := y] so x does not occur 
in Y’, but clearly, a bound for y in y’ gives us the same bound for x in 7. 

In practice, before we apply the theorem we decompose the conjunctive for- 
mula y into subformulas that have disjoint sets of variables. We write y as 
yiA...A\ Ym where the conjuncts have no common variables. Then, ¢ is satisfi- 
able if each conjunct yy is satisfiable and we derive upper bounds on the shortest 
solution for the variables of p+, which gives more precise bounds than deriving 
bounds from ¢ directly. In particular, if a variable x; of w does not occur in any 
inequality, then the bound on |h(x;)| is |Q;l. 

Theorem 6.1 only holds for conjunctive formulas. For an arbitrary (non- 
conjunctive) formula 7, a generalization is to convert w into disjunctive normal 
form. Alternatively, it is sufficient to enumerate the subsets of lits(w). Given a 
subset A of lits(q), let us denote by d4 a mapping that bounds the length of 
solutions to A, i.e., any solution h to A satisfies |h(x)| < d(x). This mapping 
da can be computed from Theorem 6.1. The following property gives a bound 
for w. 


Proposition 6.2. If wv is satisfiable, then it has a model h such that for all 
x EL, it holds that |h(x)| < max{da(x) | A C lits(a)}. 


Proof. We can assume that ~ is in negation normal form. We can then convert ¢ 
to disjunctive normal form Y & Y1 V---VYn and we have lits(w;) C lits(q). Also, 
w is satisfiable if and only if at least one 7; is satisfiable and the proposition 
follows. 


Since there are 2!"s)I subsets of lits(w), a direct application of Proposition 6.2 
is rarely feasible in practice. Fortunately, we can use unsatisfiable cores to reduce 
the number of subsets to consider. 


6.1 Unsatisfiable-Core Analysis 


Instead of calculating the bounds upfront, we use the unsatisfiable core produced 
by the SAT solver after each incremental call to evaluate whether the upper 
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bounds on the variables exceed the upper bounds of the shortest solution. If 
[wy is unsatisfiable for bounds b, then it has an unsatisfiable core 


C=CAACr NAN Gas N G 


a€atomst (p) a€atoms™ (wp) 


with (possibly empty) subsets of clauses Ca C Ya, Cn C [A]”, Ca C (dla) > 
[a]”), and Ca C (~ d(a) > [-a]”). Here we implicitly assume Y4, d(a) > [a], 
and +d(a) — [a]? to be in CNF. Let C+ = {a | Ca # 0} and C~ = {~a | Ca # 
Ø} be the sets of literals whose encodings contain at least one clause of the core 
C. Using these sets, we construct the formula 


YE =van VAN d(a) > a^ VAN ad(a) > 7a, 


aect ~ac- 


which consists of the conjunction of the abstraction and the definitions of the 
literals that are contained in Ct, respectively C7. Recall that w is equisatisfiable 
to the conjunction Y4 ^ Aaen d of the abstraction and all definitions in D. Let 
wy’ denote this formula, i.e., 


Y = pa^ \ d(a) > aA \ ad(a) > ~a. 
a€atomst (p) aa€atoms— (y) 


The following proposition shows that it suffices to refine the bounds according 
to yê. 


Proposition 6.3. Let be unsatisfiable with respect to b and let C be an unsat- 
isfiable core of i”. Then, W° is unsatisfiable with respect to b and w! KW. 


Proof. By definition, we have al = ya A [nA]? A \aec+ dla) > [a]? A 
A-aec- 74(a) > =[-a]?. This implies C C [vey and, since C is an unsat- 
isfiable core, al is unsatisfiable. That is, y? is unsatisfiable with respect 
to b. We also have y’ — Y° since Ct C atomst(w) and C~ C atoms™ (4). 


Applying Proposition 6.2 to ~ results in the upper bounds of the shortest 
solution hc for W°. If |ao(x)| < b(x) holds for all x € I’, then Y7 has no solution 
and unsatisfiability of ~’ follows from Proposition 6.3. Because % and yw’ are 
equisatisfiable, we can conclude that w is unsatisfiable. 

Otherwise, we increase the bounds on the variables that occur in Y7 while 
keeping bounds on the other variables unchanged: We construct b,41 with 
be(x) < bk41(x) < |he(x)| for all x € F, such that by(y) < br4i(y) holds for 
at least one y € V (4°). By strictly increasing at least one variable’s bound, we 
eventually either reach the upper bounds of ° and return unsatisfiability, or we 
eliminate it as an unsatisfiable implication of w. As there are only finitely many 
possibilities for C and thus for ~°, our procedure is guaranteed to terminate. 
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We do not explicitly construct formula © to compute bounds on hg as we 
know the set lits(yf) =C+U C7. Finding upper bounds still requires enumerat- 
ing all subsets of lits(7°), but we have |lits(w)| < |lits(Y)| and usually lits(Y®) 
is much smaller than lits(w). For example, consider the formula 


Y =z # abd A (x=aVxe€ab*)Ax=yA (y= bbe Vz E a(blc)*d) Ay € ab-?* 


which is unsatisfiable for the bounds b(x) = b(y) = 1 and b(z) = 4. 
The unsatisfiable core C returned after solving [y]? results in the for- 
mula Y? = (x = a V x È ab*) Ax = y A y È ab-?* containing four literals. Finding 
upper bounds for ~° thus amounts to enumerating just 24 subsets, which is sub- 
stantially less than considering all 2’ subsets of lits(w) upfront. The conjunction 
of a subset of lits(w°) yielding the largest upper bounds is x € ab* Ax =yAy € 
ab-?*, which simplifies to x € ab* N ab-?* and has a solution of length at most 2 
for x and y. With bounds b(x) = b(y) = 2 and b(z) = 4, the formula is satisfiable. 


7 Implementation 


We have implemented our approach in a solver called NFA2SAT. NFA2SAT is 
written in RUST and uses CADICAL [9] as the backend SAT solver. We use the 
incremental API provided by CADICAL to solve problems under assumptions. 
Soundness of NFA2SAT follows from Theorem 5.1. For completeness, we rely 
on CADICAL’s failed function to efficiently determine failed assumptions, i.e., 
assumption literals that were used to conclude unsatisfiability. 

The procedure works as follows. Given a formula w, we first introduce one 
fresh Boolean selector variable s; for each theory literal I € lits(q). Then, instead 
of adding the encoded definitions of the theory literals directly to the SAT 
solver, we precede them with their corresponding selector variables: for a pos- 
itive literal a, we add sa — (d(a) — [a]), and for a negative literal ~a, we 
add sia — (-d(a) = [a]) (considering assumptions introduced by [a] as unit 
clauses). In the resulting CNF formula, the new selector variables are present 
in all clauses that encode their corresponding definition, and we use them as 
assumptions for every incremental call to the SAT solver, which does not affect 
satisfiability. If such an assumption failed, then we know that at least one of the 
corresponding clauses in the propositional formula was part of an unsatisfiable 
core, which enables us to efficiently construct the sets Ct and C7 of positive and 
negative atoms present in the unsatisfiable core. As noted previously, we have 
lits(w°) = Ct UCT and hence the sets are sufficient to find bounds on a shortest 
model for Yl. 

This approach is efficient for obtaining lits(7°) but since CADICAL does not 
guarantee that the set of failed assumptions is minimal, lits(7°) is not minimal 
in general. Moreover, even a minimal lits(7/°) can contain too many elements 
for processing all subsets. To address this issue, we enumerate the subsets only 
if lits(w°) is small (by default, we use a limit of ten literals). In this case, we 
construct the automata M; used in Theorem 6.1 for each subset, facilitating the 
techniques described in [7] for quickly ruling out unsatisfiable ones. Otherwise, 
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instead of enumerating the subsets, we resort to sound approximations of upper 
bounds, which amounts to over-approximating the number of states without 
explicitly constructing the automata (c.f. [14]). 

Once we have obtained upper bounds on the length of the solution of Y7, we 
increment bounds on all variables involved, except those that have reached their 
maximum. Our default heuristics computes a new bound that is either double 
the current bound of a variable or its maximum, whichever is smaller. 


8 Experimental Evaluation 


We have evaluated our solver on a large set of benchmarks from the ZALIGVIN- 
DER [22] repository?. The repository contains 120,287 benchmarks stemming 
from both academic and industrial applications. In particular, all the string prob- 
lems from the SMT-LIB repository,” are included in the ZALIGVINDER reposi- 
tory. We converted the ZALIGVINDER problems to the SMT-LIB 2.6 syntax and 
removed duplicates. This resulted in 82,632 unique problems out of which 29,599 
are in the logical fragment we support. 

We compare NFA2SAT with the state-of-the-art solvers CVC5 (version 1.0.3) 
and Z3 (version 4.12.0). The comparison is limited to these two solvers because 
they are widely adopted and because they had the best performance in our evalu- 
ation. Other string solvers either don’t support our logical fragment (CERTISTR, 
WOORPJE) or gave incorrect answers on the benchmark problems considered 
here. Older, no-longer maintained, solvers have known soundness problems, as 
reported in [7] and [27]. 

We ran our experiment on a Linux server, with a timeout of 1200s seconds 
CPU time and a memory limit of 16 GB. Table 1 shows the results. As a single 
tool, NFA2SAT solves more problems than Cvc5 but not as many as Z3. All three 
tools solve more than 98% of the problems. 

The table also shows results of portfolios that combine two solvers. In a port- 
folio configuration, the best setting is to use both Z3 and NFA2SAT. This com- 
bination solves all but 20 problems within the timeout. It also reduces the total 
run-time from 283,942s for Z3 (about 79h) to 28,914s for the portfolio (about 
8h), that is, a 90% reduction in total solve time. The other two portfolios— 
namely, Z3 with cvc5 and NFA2SAT with Cvc5—also have better performance 
than a single solver, but the improvement in runtime and number of timeouts is 
not as large. 

Figure 4a illustrates why NFA2SAT and Z3 complement each other well. The 
figure shows three scatter plots that compare the runtime of NFA2SAT and Z3 on 
our problems. The plot on the left compares the two solvers on all problems, the 
one in the middle compares them on satisfiable problems, and the one on the right 
compares them on unsatisfiable problems. Points in the left plot are concentrated 
close to the axes, with a smaller number of points near the diagonal, meaning 
that Z3 and NFA2SAT have different runtime on most problems. The other two 


? https: //github.com/zaligvinder /zaligvinder. 
3 https: //clc-gitlab.cs.uiowa.edu:2443 /SMT-LIB-benchmarks/QF-S. 
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Table 1. Evaluation on ZALIGVINDER benchmarks. The three left columns show results 
of individual solvers. The other three columns show results of portfolios combining two 
solvers. 


cvcd Z3 NFA2SAT cvcd NFA2SAT NFA2SAT 
Z3 cvc5 Z3 

SAT 22895 22927 22922 22934 22934 22934 
UNSAT 6259 6486 6405 6526 6598 6645 
Timeout 445 185 206 139 67 20 
Out-of-memory 0 1 66 n/a n/a n/a 
Total Solved 29154 29413 29327 29460 29532 29579 
Total Runtime (s) | 655877 283942 275420 169553 126655 28914 
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Fig. 4. Comparison of runtime (in seconds) with Z3 and cvc5. The left plots include 
all problems, the middle plots include only satisfiable problems, and the right plots 
include only unsatisfiable problems. The lines marked “failed” correspond to problems 
that are not solved because a solver ran out of memory. The lines marked “timeout” 
correspond to problems not solved because of a timeout (1200s). 


plots show this even more clearly: NFA2SAT is faster on satisfiable problems while 
Z3 is faster on unsatisfiable problems. Figure 4b shows analogous scatter plots 
comparing NFA2SAT and Cvc5. The two solvers show similar performance on 
a large set of easy benchmarks although Cvc5d is faster on problems that both 
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solvers can solve in less than 1s. However, CvC5 times out on 38 problems that 
NFA2SAT solves in less than 2s. On unsatisfiable problems, Cvc5 tends to be 
faster than NFA2SAT, but there is a class of problems for which NFA2SAT takes 
between 10 and 100s whereas CVC5 is slower. 

Overall, the comparison shows that NFA2SAT is competitive with Cvc5 and 
Z3 on these benchmarks. We also observe that NFA2SAT tends to work better on 
satisfiable problems. For best overall performance, our experiments show that a 
portfolio of Z3 and NFA2SAT would solve all but 20 problems within the timeout, 
and reduce the total solve time by 90%. 


9 Conclusion 


We have presented the first eager SAT-based approach to string solving that is 
both sound and complete for a reasonably expressive fragment of string theory. 
Our experimental evaluation shows that our approach is competitive with the 
state-of-the-art lazy SMT solvers Z3 and Cvc5, outperforming them on satisfi- 
able problems but falling behind on unsatisfiable ones. A portfolio that combines 
our approach with these solvers—particularly with Z3—would thus yield strong 
performance across both types of problems. 

In future work, we plan to extend our approach to a more expressive logi- 
cal fragment, including more general word equations. Other avenues of research 
include the adaption of model checking techniques such as IC3 [10] to string 
problems, which we hope would lead to better performance on unsatisfiable 
instances. A particular benefit of the eager approach is that it enables the use 
of mature techniques from the SAT world, especially for proof generation and 
parallel solving. Producing proofs of unsatisfiability is complex for traditional 
CDCL(T) solvers because of the complex rewriting and deduction rules they 
employ. In contrast, efficiently generating and checking proofs produced by SAT 
solvers (using the DRAT format [32]) is well-established and practicable. A chal- 
lenge in this respect would be to combine unsatisfiability proofs from a SAT 
solver with proof that our reduction to SAT is sound. For parallel solving, we 
plan to explore the use of a parallel incremental solver (such as ILINGELING [9]) 
as well as other possible ways to solve multiple bounds in parallel. 


References 


1. Abdulla, P.A., et al.: Trau: SMT solver for string constraints. In: 2018 Formal 
Methods in Computer Aided Design (FMCAD), pp. 1-5 (2018). https://doi.org/ 
10.23919/FMCAD.2018.8602997 

2. Backes, J., et al.: Semantic-based automated reasoning for AWS access policies 
using SMT. In: 2018 Formal Methods in Computer Aided Design (FMCAD), pp. 
1-9 (2018). https: //doi.org/10.23919/FMCAD.2018.8602994 


206 


10. 


11. 


12. 


13. 


14. 


15. 


K. Lotz et al. 


Barbosa, H., et al.: cvc5: A versatile and industrial-strength SMT solver. In: Fis- 
man, D., Rosu, G. (eds.) Tools and Algorithms for the Construction and Analysis 
of Systems - 28th International Conference, TACAS 2022, Held as Part of the 
European Joint Conferences on Theory and Practice of Software, ETAPS 2022, 
Munich, Germany, April 2-7, 2022, Proceedings, Part I. Lecture Notes in Com- 
puter Science, vol. 13243, pp. 415-442. Springer (2022). https://doi.org/10.1007/ 
978-3-030-99524-9_24 


. Barrett, C., Fontaine, P., Tinelli, C.: The SMT-LIB Standard: Version 2.6. Tech. 


rep., Department of Computer Science, The University of Iowa (2017). www.smt- 
lib.org 

Berzish, M., et al.: String theories involving regular membership predicates: From 
practice to theory and back. In: Lecroq, T., Puzynina, S. (eds.) Combinatorics on 
Words, pp. 50-64. Springer International Publishing, Cham (2021) 

Berzish, M., et al.: Towards more efficient methods for solving regular-expression 
heavy string constraints. Theoretical Computer Science 943, 50-72 (2023). https:// 
doi.org/10.1016/j.tcs.2022.12.009, https: //www.sciencedirect.com/science/article/ 
pii/S030439752200723X 

Berzish, M., et al.: An SMT solver for regular expressions and linear arithmetic 
over string length. In: Silva, A., Leino, K.R.M. (eds.) Computer Aided Verification, 
pp. 289-312. Springer International Publishing, Cham (2021) 

Biere, A.: Bounded model checking. In: Biere, A., Heule, M., van Maaren, H., 
Walsh, T. (eds.) Handbook of Satisfiability, Frontiers in Artificial Intelligence and 
Applications, vol. 185, pp. 457—481. IOS Press (2009). https://doi.org/10.3233/ 
978-1-58603-929-5-457 

Biere, A., Fazekas, K., Fleury, M., Heisinger, M.: CaDiCaL, Kissat, Paracooba, 
Plingeling and Treengeling entering the SAT Competition 2020. In: Balyo, T., 
Froleyks, N., Heule, M., Iser, M., Jarvisalo, M., Suda, M. (eds.) Proc. of SAT 
Competition 2020 - Solver and Benchmark Descriptions. Department of Computer 
Science Report Series B, vol. B-2020-1, pp. 51-53. University of Helsinki (2020) 
Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R., 
Schmidt, D. (eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70-87. Springer, Heidel- 
berg (2011). https: //doi.org/10.1007/978-3-642-18275-4_7 

Chen, T., Hague, M., Lin, A.W., Riimmer, P., Wu, Z.: Decision procedures for 
path feasibility of string-manipulating programs with complex operations. Proc. 
ACM Program. Lang. 3(POPL) (jan 2019). https://doi.org/10.1145/3290362 
Day, J.D., Ehlers, T., Kulczynski, M., Manea, F., Nowotka, D., Poulsen, D.B.: 
On solving word equations using SAT. In: Filiot, E., Jungers, R., Potapov, I. 
(eds.) Reachability Problems, pp. 93-106. Springer International Publishing, Cham 
(2019) 

Eén, N., Sörensson, N.: Temporal induction by incremental SAT solving. Electronic 
Notes in Theoretical Computer Science 89(4), 543-560 (2003). https://doi.org/10. 
1016/S1571-0661(05)82542-3, https://www.sciencedirect.com/science/article/pii/ 
$1571066105825423, bMC’2003, First International Workshop on Bounded Model 
Checking 

Gao, Y., Moreira, N., Reis, R., Yu, S.: A survey on operational state complexity. 
CoRR abs/1509.03254 (2015), http://arxiv.org/abs/1509.03254 

Hojjat, H., Riimmer, P., Shamakhi, A.: On strings in software model checking. 
In: Lin, A.W. (ed.) Programming Languages and Systems, pp. 19-30. Springer 
International Publishing, Cham (2019) 


16. 


A 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


28. 


Solving String Constraints Using SAT 207 


Jez, A.: Word Equations in Nondeterministic Linear Space. In: Chatzigiannakis, 
I., Indyk, P., Kuhn, F., Muscholl, A. (eds.) 44th International Colloquium on 
Automata, Languages, and Programming (ICALP 2017). Leibniz International 
Proceedings in Informatics (LIPIcs), vol. 80, pp. 95:1-95:13. Schloss Dagstuhl- 
Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2017). https://doi.org/10. 
4230/LIPIcs.ICALP.2017.95, http://drops.dagstuhl.de/opus/volltexte/2017/7408 
Kan, S., Lin, A.W., Rümmer, P., Schrader, M.: Certistr: A certified string solver. 
In: Proceedings of the 11th ACM SIGPLAN International Conference on Certi- 
fied Programs and Proofs, pp. 210-224. CPP 2022, Association for Computing 
Machinery, New York, NY, USA (2022). https://doi.org/10.1145/3497775.3503691 
Karhumäki, J., Mignosi, F., Plandowski, W.: The expressibility of languages and 
relations by word equations. J. ACM 47(3), 483-505 (may 2000). https://doi.org/ 
10.1145 /337244.337255 

Kiezun, A., Ganesh, V., Guo, P.J., Hooimeijer, P., Ernst, M.D.: Hampi: A solver for 
string constraints. In: Proceedings of the Eighteenth International Symposium on 
Software Testing and Analysis, pp. 105-116. ISSTA ’09, Association for Computing 
Machinery, New York, NY, USA (2009). https: //doi.org/10.1145/1572272.1572286 
Klieber, W., Kwon, G.: Efficient CNF encoding for selecting 1 from N objects. In: 
Fourth Workshop on Constraints in Formal Verification (CFV) (2007) 
Kulczynski, M., Lotz, K., Nowotka, D., Poulsen, D.B.: Solving string theories 
involving regular membership predicates using SAT. In: Legunsen, O., Rosu, G. 
(eds.) Model Checking Software, pp. 134-151. Springer International Publishing, 
Cham (2022) 

Kulczynski, M., Manea, F., Nowotka, D., Poulsen, D.B.: Zaligvinder: A 
generic test framework for string solvers. J. Softw.: Evolution and Process 
n/a(n/a), e2400. https://doi.org/10.1002/smr.2400, https://onlinelibrary.wiley. 
com/doi/abs/10.1002/smr.2400 

Makanin, G.S.: The problem of solvability of equations in a free semi- 
group. Math. USSR, Sb. 32, 129-198 (1977). https://doi.org/10.1070/ 
SM1977v032n02A BEH002376 

Mora, F., Berzish, M., Kulczynski, M., Nowotka, D., Ganesh, V.: Z3str4: A multi- 
armed string solver. In: Huisman, M., Pasadreanu, C., Zhan, N. (eds.) Formal Meth- 
ods, pp. 389-406. Springer International Publishing, Cham (2021) 

de Moura, L., Bjørner, N.: Z3: An efficient SMT solver. In: Proceedings of 
the Theory and Practice of Software, 14th International Conference on Tools 
and Algorithms for the Construction and Analysis of Systems, pp. 337-340. 
TACAS’08/ETAPS’08, Springer-Verlag, Berlin, Heidelberg (2008) 

Murray, N.V.: Completely non-clausal theorem proving. Artificial Intelligence 
18(1), 67-85 (1982). https://doi-org/10.1016/0004-3702(82)90011-X, https:// 
www.sciencedirect.com/science/article/pii/000437028290011X 

Notzli, A., Reynolds, A., Barbosa, H., Barrett, C.W., Tinelli, C.: Even faster 
conflicts and lazier reductions for string solvers. In: Shoham, S., Vizel, Y. (eds.) 
Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, 
Israel, August 7-10, 2022, Proceedings, Part II. Lecture Notes in Computer Sci- 
ence, vol. 13372, pp. 205-226. Springer (2022). https: //doi.org/10.1007/978-3-031- 
13188-2_11, https://doi.org/10.1007/978-3-031-13188-2_11 

Plaisted, D.A., Greenbaum, S.: A structure-preserving clause form translation. 
Journal of Symbolic Computation 2(3), 293-304 (1986). https://doi.org/10. 
1016/S0747-7171(86)80028-1, https://www.sciencedirect.com/science/article/pii/ 
80747717186800281 


208 


29. 


30. 


3l. 


32. 


33. 


K. Lotz et al. 


Plandowski, W.: Satisfiability of word equations with constants is in PSPACE. 
In: 40th Annual Symposium on Foundations of Computer Science (Cat. 
No.99CB37039), pp. 495-500 (1999). https://doi-org/10.1109/SFFCS.1999.814622 
Rungta, N.: A billion SMT queries a day (invited paper). In: Shoham, S., Vizel, Y. 
(eds.) Computer Aided Verification. pp. 3-18. Springer International Publishing, 
Cham (2022). https://doi.org/10.1007/978-3-031-13185-1_1 

Saxena, P., Akhawe, D., Hanna, S., Mao, F., McCamant, S., Song, D.: A symbolic 
execution framework for JavaScript. In: 2010 IEEE Symposium on Security and 
Privacy, pp. 513-528 (2010). https://doi.org/10.1109/SP.2010.38 

Wetzler, N., Heule, M., Jr., W.A.H.: Drat-trim: Efficient checking and trimming 
using expressive clausal proofs. In: Sinz, C., Egly, U. (eds.) Theory and Applica- 
tions of Satisfiability Testing - SAT 2014 - 17th International Conference, Held 
as Part of the Vienna Summer of Logic, VSL 2014, Vienna, Austria, July 14-17, 
2014. Proceedings. Lecture Notes in Computer Science, vol. 8561, pp. 422—429. 
Springer (2014). https://doi.org/10.1007/978-3-319-09284-3_31, https://doi.org/ 
10.1007 /978-3-319-09284-3_31 

Yu, F., Alkhalaf, M., Bultan, T., Ibarra, O.H.: Automata-based symbolic string 
analysis for vulnerability detection. Formal Methods Syst. Design 44(1), 44- 
70 (2014). https: //doi.org/10.1007/s10703-013-0189-1, https://doi.org/10.1007/ 
$10703-013-0189-1 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 


The images or other third party material in this chapter are included in the 


chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 


use, 


you will need to obtain permission directly from the copyright holder. 


® 


Check for 
updates 


The GOLEM Horn Solver 


Martin Blicha!:?“)@, Konstantin Britikov'®, and Natasha Sharygina'® 


CAV 1 Universita della Svizzera Italiana, Lugano, Switzerland CAV 
Artifact {blichm, britik, sharygin}@usi.ch Artifact 


evoluohen 2 Charles University, Prague, Czech Republic eee 


Available Reusable 


Abstract. The logical framework of Constrained Horn Clauses (CHC) 
models verification tasks from a variety of domains, ranging from verifi- 
cation of safety properties in transition systems to modular verification 
of programs with procedures. In this work we present GOLEM, a flexible 
and efficient solver for satisfiability of CHC over linear real and integer 
arithmetic. GOLEM provides flexibility with modular architecture and 
multiple back-end model-checking algorithms, as well as efficiency with 
tight integration with the underlying SMT solver. This paper describes 
the architecture of GOLEM and its back-end engines, which include our 
recently introduced model-checking algorithm TPA for deep exploration. 
The description is complemented by extensive evaluation, demonstrating 
the competitive nature of the solver. 


Keywords: Constrained Horn Clauses - Model Checking 


1 Introduction 


The framework of Constrained Horn Clauses (CHC) has been proposed as a uni- 
fied, purely logic-based, intermediate format for software verification tasks [33]. 
CHC provides a powerful way to model various verification problems, such as 
safety, termination, and loop invariant computation, across different domains like 
transition systems, functional programs, procedural programs, concurrent sys- 
tems, and more [33-35,41]. The key advantage of CHC is the separation of mod- 
elling from solving, which aligns with the important software design principle— 
separation of concerns. This makes CHCs highly reusable, allowing a specialized 
CHC solver to be used for different verification tasks across domains and pro- 
gramming languages. The main focus of the front end is then to translate the 
source code into the language of constraints, while the back end can focus solely 
on the well-defined formal problem of deciding satisfiability of a CHC system. 
CHC-based verification is becoming increasingly popular, with several frame- 
works developed in recent years, including SEAHORN, KORN and TRICERA for 
C [27,28,36], JAYHORN for Java [44], RUSTHORN for Rust [48], HORNDRoID for 
Android [18], SolCMC and SmartACE for Solidity [2,57]. A novel CHC-based 
approach for testing also shows promising results [58]. The growing demand from 
verifiers drives the development of specialized Horn solvers. Different solvers 
implement different techniques based on, e.g., model-checking approaches (such 
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as predicate abstraction [32], CEGAR [22] and IC3/PDR [16,26]), machine learn- 
ing, automata, or CHC transformations. ELDARICA [40] uses predicate abstrac- 
tion and CEGAR as the core solving algorithm. It leverages Craig interpo- 
lation [23] not only to guide the predicate abstraction but also for accelera- 
tion [39]. Additionally, it controls the form of the interpolants with interpolation 
abstraction [46,53]. SPACER [45] is the default algorithm for solving CHCs in 
Z3 [51]. It extends PDR-style algorithm for nonlinear CHC [38] with under- 
approximations and leverages model-based projection for predecessor computa- 
tion. Recently it was enriched with global guidance [37]. ULTIMATE TREEAU- 
TOMIZER [25] implements automata-based approaches to CHC solving [43,56]. 
HolIce [20] implements a machine-learning-based technique adapted from the 
ICE framework developed for discovering inductive invariants of transition sys- 
tems [19]. FREQHORN [29,30] combines syntax-guided synthesis [4] with data 
derived from unrollings of the CHC system. 

According to the results of the international competition on CHC solving 
CHC-COMP [24,31,54], solvers applying model-checking techniques, namely 
SPACER and ELDARICA, are regularly outperforming the competitors. These 
are the solvers most often used as the back ends in CHC-based verification 
projects. However, only specific algorithms have been explored in these tools 
for CHC solving, limiting their application for diverse verification tasks. Experi- 
ence from software verification and model checking of transition systems shows 
that in contrast to the state of affairs in CHC solving, it is possible to build a 
flexible infrastructure with a unified environment for multiple back-end solving 
algorithms. CPACHECKER [6-11], and Pono [47] are examples of such tools. 

This work aims to bring this flexibility to the general domain-independent 
framework of constrained Horn clauses. We present GOLEM, a new solver 
for CHC satisfiability, that provides a unique combination of flexibility and 
efficiency.| GOLEM implements several SMT-based model-checking algorithms: 
our recent model-checking algorithm based on Transition Power Abstraction 
(TPA) [13,14], and state-of-the-art model-checking algorithms Bounded Model 
Checking (BMC) [12], k-induction [55], Interpolation-based Model Checking 
(IMC) [49], Lazy Abstractions with Interpolants (LAWI) [50] and SPACER [45]. 
GOLEM achieves efficiency through tight integration with the underlying interpo- 
lating SMT solver OPENSMT [17,42] and preprocessing transformations based 
on predicate elimination, clause merging and redundant clause elimination. The 
flexible and modular framework of OPENSMT enables customization for differ- 
ent algorithms; its powerful interpolation modules, particularly, offer fine con- 
trol (in size and strength) with multiple interpolant generation procedures. We 
report experimentation that confirms the advantage of multiple diverse solving 
techniques and shows that GOLEM is competitive with state-of-the-art Horn 
solvers on large sets of problems.” Overall, GOLEM can serve as an efficient back 


1 GoLeM is available at https: //github.com/usi-verification-and-security /golem. 

? This is in line with results from CHC-COMP 2021 and 2022 [24,31]. In 2022, GOLEM 
beat other solvers except Z3-SPACER in the LRA-TS, LIA-Lin and LIA-Nonlin 
tracks. 
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end for domain-specific verification tools and as a research tool for prototyping 
and evaluating SMT- and interpolation-based verification techniques in a unified 
setting. 


2 Tool Overview 


In this section, we describe the main components and features of the tool together 
with the details of its usage. For completeness, we recall the terminology related 
to CHCs first. 


Constrained Horn Clauses. A constrained Horn clause is formula pA B,AB2/A 
...A\B, => H, where ¢ is the constraint, a formula in the background theory, 
B,,..., Bn are uninterpreted predicates, and H is an uninterpreted predicate or 
false. The antecedent of the implication is commonly denoted as the body and 
the consequent as the head. A clause with more than one predicate in the body is 
called nonlinear. A nonlinear system of CHCs has at least one nonlinear clause; 
otherwise, the system is linear. 


: >) 
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Fig. 1. High-level architecture of GOLEM 


Architecture. The flow of data inside GOLEM is depicted in Fig. 1. The system 
of CHCs is read from .smt2 file, a script in an extension of the language of SMT- 
LIB.* Interpreter interprets the SMT-LIB script and builds the internal rep- 
resentation of the system of CHCs. In GOLEM, CHCs are first normalized, then 
the system is translated into an internal graph representation. Normalization 
rewrites clauses to ensure that each predicate has only variables as arguments. 
The graph representation of the system is then passed to the Preprocessor, 
which applies various transformations to simplify the input graph. Preprocessor 
then hands the transformed graph to the chosen back-end engine. Engines in 


3 https: //che-comp.github.io/format.html. 
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GOLEM implement various SMT-based model-checking algorithms for solving 
the CHC satisfiability problem. There are currently six engines in GOLEM: TPA, 
BMC, KIND, IMC, LAWI, and SPACER (see details in Sect. 3). User selects the 
engine to run using a command-line option --engine. GOLEM relies on the inter- 
polating SMT solver OPENSMT [42] not only for answering SMT queries but 
also for interpolant computation required by most of the engines. Interpolating 
procedures in OPENSMT can be customized on demand for the specific needs of 
each engine [1]. Additionally, GOLEM re-uses the data structures of OPENSMT 
for representing and manipulating terms. 


Models and Proofs. Besides solving the CHC satisfiability problem, a witness 
for the answer is often required by the domain-specific application. Satisfiabil- 
ity witness is a model, an interpretation of the CHC predicates that makes all 
clauses valid. Unsatisfiability witness is a proof, a derivation of the empty clause 
from the input clauses. In software verification these witnesses correspond to pro- 
gram invariants and counterexample paths, respectively. All engines in GOLEM 
produce witnesses for their answer. Witnesses from engines are translated back 
through the applied preprocessing transformations. Only after this backtransla- 
tion, the witness matches the original input system and is reported to the user. 
Witnesses must be explicitly requested with the option --print-witness. 

Models are internally stored as formulas in the background theory, using only 
the variables of the (normalized) uninterpreted predicates. They are presented 
to the user in the format defined by SMT-LIB [5]: a sequence of SMT-LIB’s 
define-fun commands, one for each uninterpreted predicate. 

For the proofs, GOLEM follows the trace format proposed by ELDARICA. 
Internally, proofs are stored as a sequence of derivation steps. Every derivation 
step represents a ground instance of some clause from the system. The ground 
instances of predicates from the body form the premises of the step, and the 
ground instance of the head’s predicate forms the conclusion of the step. For 
the derivation to be valid, the premises of each step must have been derived 
earlier, i.e., each premise must be a conclusion of some derivation step earlier in 
the sequence. To the user, the proof is presented as a sequence of derivations of 
ground instances of the predicates, where each step is annotated with the indices 
of its premises. See Example 1 below for the illustration of the proof trace. 

GOLEM also implements an internal validator that checks the correctness 
of the witnesses. It validates a model by substituting the interpretations for the 
predicates and checking the validity of all the clauses with OPENSMT. Proofs are 
validated by checking all conditions listed above for each derivation step. Valida- 
tion is enabled with an option --validate and serves primarily as a debugging 
tool for the developers of witness production. 
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Example 1. Consider the following CHC system and the proof of its 
unsatisfiability. 


x>0 = L(x) 1. £y(1) 
xt =xr+1 => Dea) 2. D(1,2) 
Li(x) A D(a,2') = L2(x') 3. La(2) ;1,2 
L2(x) ^z <2 => false 4. false 3 


The derivation of false consists of four derivation steps. Step 1 instantiates 
the first clause for x := 1. Step 2 instantiates the second clause for x := 1 and 
x’ := 2. Step 3 applies resolution to the instance of the third clause for x := 1 
and x’ := 2 and facts derived in steps 1 and 2. Finally, step 4 applies resolution 
to the instance of the fourth clause for x := 2 and the fact derived in step 3. 


Preprocessing Transformations. Preprocessing can significantly improve 
performance by transforming the input CHC system into one more suitable for 
the back-end engine. The most important transformation in GOLEM is predicate 
elimination. Given a predicate not present in both the body and the head of the 
same clause, the predicate can be eliminated by exhaustive application of the 
resolution rule. This transformation is most beneficial when it also decreases the 
number of clauses. Clause merging is a transformation that merges all clauses 
with the same uninterpreted predicates in the body and the head to a single 
clause by disjoining their constraints. This effectively pushes work from the level 
of the model-checking algorithm to the level of the SMT solver. Additionally, 
GOLEM detects and deletes redundant clauses, i.e., clauses that cannot partici- 
pate in the proof of unsatisfiability. 

An important feature of GOLEM is that all applied transformations are 
reversible in the sense that any model or proof for the transformed system can 
be translated back to a model or proof of the original system. 


3 Back-end Engines of GOLEM 


The core components of GOLEM that solve the problem of satisfiability of a CHC 
system are referred to as back-end engines, or just engines. GOLEM implements 
several popular state-of-the-art algorithms from model checking and software 
verification: BMC, k-induction, IMC, LAWI and SPACER. These algorithms treat 
the problem of solving a CHC system as a reachability problem in the graph 
representation. 

The unique feature of GOLEM is the implementation of the new model- 
checking algorithm based on the concept of Transition Power Abstraction (TPA). 
It is capable of much deeper analysis than other algorithms when searching for 
counterexamples [14], and it discovers transition invariants [13], as opposed to 
the usual (state) invariants. 


214 M. Blicha et al. 


3.1 Transition Power Abstraction 


The TPA engine in GOLEM implements the model-checking algorithm based 
on the concept of Transition Power Abstraction. It can work in two modes: 
The first mode implements the basic TPA algorithm, which uses a single TPA 
sequence [14]. The second mode implements the more advanced version, SPLIT- 
TPA, which relies on two TPA sequences obtained by splitting the single TPA 
sequence of the basic version [13]. In GOLEM, both variants use the under- 
approximating model-based projection for propagating truly reachable states, 
avoiding full quantifier elimination. Moreover, they benefit from incremental 
solving available in OPENSMT, which speeds up the satisfiability queries. 

The TPA algorithms, as described in the publications, operate on transition 
systems [13,14]. However, the engine in GOLEM is not limited to a single tran- 
sition system. It can analyze a connected chain of transition systems. In the 
software domain, this model represents programs with a sequence of consecutive 
loops. The extension to the chain of transition systems works by maintaining a 
separate TPA sequence for each node on the chain, where each node has its own 
transition relation. The reachable states are propagated forwards on the chain, 
while safe states—from which final error states are unreachable—are propagated 
backwards. In this scenario, transition systems on the chain are queried for reach- 
ability between various initial and error states. Since the transition relations 
remain the same, the summarized information stored in the TPA sequences can 
be re-used across multiple reachability queries. The learnt information summa- 
rizing multiple steps of the transition relation is not invalidated when the initial 
or error states change. 

GOLEM’s TPA engine discovers counterexample paths in unsafe transition 
systems, which readily translate to unsatisfiability proofs for the corresponding 
CHC systems. For safe transition systems, it discovers safe k-inductive transi- 
tion invariants. If a model for the corresponding CHC system is required, the 
engine first computes a quantified inductive invariant and then applies quantifier 
elimination to produce a quantifier-free inductive invariant, which is output as 
the corresponding model.* 

The TPA engine’s ability to discover deep counterexamples and transition 
invariants gives GOLEM a unique edge for systems requiring deep exploration. 
We provide an example of this capability as part of the evaluation in Sect. 4. 


3.2 Engines for State-of-the-Art Model-Checking Algorithms 


Besides TPA, GOLEM implements several popular state-of-the-art model- 
checking algorithms. Among them are bounded model checking [12], k- 
induction [55] and McMillan’s interpolation-based model checking [49], which 
operate on transition systems. GOLEM faithfully follows the description of the 
algorithms in the respective publications. 


t The generation of unsatisfiability proofs also works for the extension to chains of 
transition systems, while the generation of models for this case is still under devel- 
opment. 
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Additionally, GOLEM implements Lazy Abstractions with Interpolants 
(LAWI), an algorithm introduced by McMillan for verification of software [50].° 
In the original description, the algorithm operates on programs represented with 
abstract reachability graphs, which map straightforwardly to linear CHC systems. 
This is the input supported by our implementation of the algorithm in GOLEM. 

The last engine in GOLEM implements the IC3-based algorithm SPACER [45] 
for solving general, even nonlinear, CHC systems. Nonlinear CHC systems can 
model programs with summaries, and in this setting, SPACER computes both 
under-approximating and over-approximating summaries of the procedures to 
achieve modular analysis of programs. SPACER is currently the only engine in 
GOLEM capable of solving nonlinear CHC systems. 

All engines in GOLEM rely on OPENSMT for answering SMT queries, often 
leveraging the incremental capabilities of OPENSMT to implement the corre- 
sponding model-checking algorithm efficiently. Additionally, the engines IMC, 
LAWI, SPACER and TPA heavily use the flexible and controllable interpolation 
framework in OPENSMT [1,52], especially multiple interpolation procedures for 
linear-arithmetic conflicts [8,15]. 


4 Experiments 


In this section, we evaluate the performance of individual GOLEM’s engines on 
the benchmarks from the latest edition of CHC-COMP. The goal of these experi- 
ments is to 1) demonstrate the usefulness of multiple back-end engines and their 
potential combined use for solving various problems, and 2) compare GOLEM 
against state-of-the-art Horn solvers. 

The benchmark collections of CHC-COMP represent a rich source of prob- 
lems from various domains.® Version 0.3.2 of GOLEM was used for these exper- 
iments. Z3-SPACER (Z3 4.11.2) and ELDARICA 2.0.8 were run (with default 
options) for comparison as the best Horn solvers available. All experiments 
were conducted on a machine with an AMD EPYC 7452 32-core processor and 
8 x 32 GiB of memory; the timeout was set to 300s. No conflicting answers were 
observed in any of the experiments. The results are in line with the results of 
the last editions of CHC-COMP where GOLEM participated [24,31]. Our artifact 
for reproducing the experiments is available at https://doi.org/10.5281/zenodo. 
7973428. 


4.1 Category LRA-TS 


We ran all engines of GOLEM on all 498 benchmarks from the LRA-TS (transition 
systems over linear real arithmetic) category of CHC-COMP. 

Table 1 shows the number of benchmarks solved per engine, together with a 
virtual best (VB) engine.’ On unsatisfiable problems, the differences between the 


5 It is also known as Impact, which was the first tool that implemented the algorithm. 
6 https: //github.com/orgs/chc-comp/repositories. 
T Virtual best engine picks the best performance from all engines for each benchmark. 
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Table 1. Number of solved benchmarks from LRA-TS category. 


BMC | KIND | IMC | LAWI Spacer | spiit-TPA | VB 
SAT 0 260 | 145 | 279 195 128 360 
UNSAT | 86 84 70 76 69 72 86 


engines’ performance are not substantial, but the BMC engine firmly dominates 
the others. On satisfiable problems, we see significant differences. Figure 2 plots, 
for each engine, the number of solved satisfiable benchmarks (x-axis) within the 
given time limit (y-axis, log scale). 
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Fig. 2. Performance of GOLEM’s engines on SAT problems of LRA-TS category. 


The large lead of VB suggests that the solving abilities of the engines are 
widely complementary. No single engine dominates the others on satisfiable 
instances. The portfolio of techniques available in GOLEM is much stronger than 
any single one of them. 

Moreover, the unified setting enables direct comparison of the algorithms. 
For example, we can conclude from these experiments that the extra check for 
k-inductive invariants on top of the BMC-style search for counterexamples, as 
implemented in the KIND engine, incurs only a small overhead on unsatisfi- 
able problems, but makes the KIND engine very successful in solving satisfiable 
problems. 


4.2 Category LIA-Lin 


Next, we considered the LIA-Lin category of CHC-COMP. These are linear sys- 
tems of CHCs with linear integer arithmetic as the background theory. There 
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are many benchmarks in this category, and for the evaluation at the competition, 
a subset of benchmarks is selected (see [24,31]). We evaluated the LAWI and 
SPACER engines of GOLEM (the engines capable of solving general linear CHC 
systems) on the benchmarks selected at CHC-COMP 2022 and compared their 
performance to Z3-SPACER and ELDARICA. Notably, we also examined a spe- 
cific subcategory of LIA-lin, namely extra-small-lia® with benchmarks that 
fall into the fragment accepted by GOLEM’s TPA engine. 

There are 55 benchmarks in extra-small-lia subcategory, all satisfiable, 
but known to be highly challenging for all tools. The results, given in Table 2, 
show that spLIT-TPA outperforms not only LAWI and SPACER engines in 
GOLEM, but also Z3-SPACER. Only ELDARICA solves more benchmars. We 
ascribe this to SPLIT-TPA’s capability to perform deep analysis and discover 
transition invariants. 


Table 2. Number of solved benchmarks from extra-small-lia subcategory. 


GOLEM 
spLit-TPA LAWI SPACER Z3-SPACER | ELDARICA 
22 | 12 18 18 36 


For the whole LIA-Lin category, 499 benchmarks were selected in the 2022 
edition of CHC-COMP [24]. The performance of the LAWI and SPACER engines 
of GOLEM, Z3-SPACER and ELDARICA on this selection is summarized in Table 3. 
Here, the SPACER engine of GOLEM significantly outperforms the LAWI engine. 
Moreover, even though GOLEM loses to Z3-SPACER, it beats ELDARICA. Given 
that GOLEM is a prototype, and Z3-SPACER and ELDARICA have been developed 
and optimized for several years, this demonstrates the great potential of GOLEM. 


Table 3. Number of solved benchmarks from LIA-Lin category. 


GOLEM 
LAWI | SPACER | Z3-SPACER | ELDARICA 
SAT 131 | 184 211 183 

UNSAT| 77 | 82 96 60 


4.3 Category LIA-Nonlin 


Finally, we considered the LIA-Nonlin category of benchmarks of CHC-COMP, 
which consists of nonlinear systems of CHCs with linear integer arithmetic as the 
background theory. For the experiments, we used the 456 benchmarks selected for 
the 2022 edition of CHC-COMP. SPACER is the only engine in GOLEM capable 
of solving nonlinear CHC systems; thus, we focused on a more detailed compar- 
ison of its performance against Z3-SPACER and ELDARICA. The results of the 
experiments are summarized in Fig.3 and Table 4. 
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Table 4. Number of solved benchmarks from LIA-Nonlin category. The number of 
uniquely solved benchmarks is in parentheses. 


GOLEM-SPACER | Z3-SPACER | ELDARICA 
SAT 239 (4) 248 (13) 221 (6) 
UNSAT | 124 (2) 139 (5) 122 (0) 


Overall, GOLEM solved fewer problems than Z3-SPACER but more than 
ELDARICA; however, all tools solved some instances uniquely. A detailed compar- 
ison is depicted in Fig. 3. For each benchmark, its data point in the plot reflects 
the runtime of GOLEM (x-axis) and the runtime of the competitor (y-axis). The 
plots suggest that the performance of GOLEM is often orthogonal to ELDARICA, 
but highly correlated with the performance of Z3-SPACER. This is not surpris- 
ing as the SPACER engine in GOLEM is built on the same core algorithm. Even 
though GOLEM is often slower than Z3-SPACER, there is a non-trivial amount 
of benchmarks on which Z3-SPACER times out, but which GOLEM solves fairly 
quickly. Thus, GOLEM, while being a newcomer, already complements existing 
state-of-the-art tools, and more improvements are expected in the near future. 

To summarise, the overall experimentation with different engines of GOLEM 
demonstrates the advantages of the multi-engine general framework and illus- 
trates the competitiveness of its analysis. It provides a lot of flexibility in address- 
ing various verification problems while being easily customizable with respect to 
the analysis demands. 


8 https: //github.com/chc-comp/extra-small-lia. 
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5 Conclusion 


In this work, we presented GOLEM, a flexible and effective Horn solver with mul- 
tiple back-end engines, including recently-introduced TPA-based model-checking 
algorithms. GOLEM is a suitable research tool for prototyping new SMT-based 
model-checking algorithms and comparing algorithms in a unified framework. 
Additionally, the effective implementation of the algorithm achieved with tight 
coupling with the underlying SMT solver makes it an efficient back end for 
domain-specific verification tools. Future directions for GOLEM include support 
for VMT input format [21] and analysis of liveness properties, extension of TPA 
to nonlinear CHC systems, and support for SMT theories of arrays, bit-vectors 
and algebraic datatypes. 
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Abstract. We present the verified model checker CoQCRYPTOLINE 
for cryptographic programs with certified verification results. The 
COQCRYPTOLINE verification algorithm consists of two reductions. The 
algebraic reduction transforms into a root entailment problem; and the 
bit-vector reduction transforms into an SMT QF_ BV problem. We 
specify and verify both reductions formally using CoQ with MATHCompP. 
The CoQCrRyYPTOLINE tool is built on the OCAML programs extracted 
from verified reductions. COQCRYPTOLINE moreover employs certified 
techniques for solving the algebraic and logic problems. We evaluate 
COQCRYPTOLINE on cryptographic programs from industrial security 
libraries. 


1 Introduction 


CoQCrRYPTOLINE [1] is a verified model checker with certified verification 
results. It is designed for verifying complex non-linear integer computations 
commonly found in cryptographic programs. The verification algorithms of 
CoQCRYPTOLINE consist of two reductions. The algebraic reduction transforms 
polynomial equality checking into a root entailment problem in commutative 
algebra; the bit-vector reduction reduces range properties to satisfiability of 
queries in the Quantifier-Free Bit-Vector (QF BV) logic from Satisfiability 
Modulo Theories (SMT) [6]. Both verification algorithms are formally specified 
and verified by the proof assistant CoQ with MATHCOMP [7,17]. COQCRYP- 
TOLINE verification programs are extracted from the formal specification and 
therefore verified by the proof assistant automatically. 

To minimize errors from external tools, recent developments in certified verifi- 
cation are employed by COQCRYPTOLINE. The root entailment problem is solved 
by the computer algebra system (CAS) SINGULAR [19]. COQCRYPTOLINE asks 
the external algebraic tool to provide certificates and validates certificates with 
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the formal polynomial theory in CoQ. SMT QF _ BV queries on the other hand 
are answered by the verified SMT QF_ BV solver COoQQFBV [33]. Answers to 
SMT QF _ BV queries are therefore all certified as well. With formally verified 
algorithms and certified answers from external tools, COQCRYPTOLINE gives 
verification results with much better guarantees than average automatic verifi- 
cation tools. 

Reliable verification tools would not be very useful if they could not check 
real-world programs effectively. In our experiments, COQCRYPTOLINE verifies 
54 real-world cryptographic programs. 52 of them are from well-known security 
libraries such as BITCOIN [35] and OPENSSL [30]. They are implementations 
of field and group operations in elliptic curve cryptography. The remaining two 
are the Number-Theoretic Transform (NTT) programs from the post-quantum 
cryptosystem KYBER [10]. All field operations are implemented in a few hundred 
lines and verified in 6 minutes. The most complicated generic group operation 
in the elliptic curve Curve25519 consists of about 4000 lines and is verified by 
Co@QCryYPTOLINE in 1.5 h. 


Related Work. There are numerous model checkers in the community, e.g. [8, 
13, 21-23]. Nevertheless, few of them are formally verified. To our knowl- 
edge, the first verification of a model checker was performed in Coq for 
the modal p-calculus [34]. The LTL model checker CAVA [15,27] and the 
model checker Munta [38,39] for timed automata were developed and verified 
using ISABELLE/HOL [29], which can be considered as verified counterparts 
of SPIN [21] and UPPAAL [23], respectively. COQCRYPTOLINE instead checks 
CrYPTOLINE models [16,31] that are for the correctness of cryptographic pro- 
grams. It can be seen as a verified version of CRYPTOLINE. A large body of work 
studies the correctness of cryptographic programs, e.g. [2—4,9,12, 14,24, 26, 40], 
cf. [5] for a survey. They either require human intervention or are unverified, while 
our work is fully automatic and verified. The most relevant work is BVCRYP- 
TOLINE [37], which is the first automated and partly verified model checker 
for a very limited subset of CRYPTOLINE. We will compare our work with it 
comprehensively in Sect. 2.3. 


2 CoQCRYPTOLINE 


CoOQCRYPTOLINE is an automatic verification tool that takes a CRYPTOLINE 
specification as input and returns certified results indicating the validity of the 
specification. We briefly describe the CRYPTOLINE language [16] followed by the 
modules, features, and optimizations of COQCRYPTOLINE in this section. 


2.1 CRYPTOLINE Language 


A CRYPTOLINE specification contains a CRYPTOLINE program with pre- 
and post-conditions, where the CRYPTOLINE program usually models some 
cryptographic program [16,31]. Both the pre- and post-conditions consist of an 
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algebraic part, which is formulated as a conjunction of (modular) equations, and 
a range part as an SMT QF _ BV predicate. A CRYPTOLINE specification is 
valid if every program execution starting from a program state satisfying the 
pre-condition ends in a state satisfying the post-condition. 

CrRYPTOLINE is designed for modeling cryptographic assembly programs. 
Besides the assignment (MOV) and conditional assignment (CMOV) statements, 
CrRYPTOLINE provides arithmetic statements such as addition (ADD), addition 
with carry (ADC), subtraction (SUB), subtraction with borrow (SBB), half multi- 
plication (MUL) and full multiplication (MULL). Most of them have versions that 
model the carry/borrow flags explicitly (like ADDS, ADCS, SUBS, SBBS). It also 
allows bitwise statements, for instance, bitwise AND (AND), OR (oR) and left- 
shift (SHL). To deal with multi-word arithmetic, CRYPTOLINE further includes 
multi-word constructs, for example, those that split (SPLIT) or join (JOIN) words, 
as well as multi-word shifts (CSHL). CRYPTOLINE is strongly typed, admitting 
both signed and unsigned interpretations for bit-vector variables and constants. 
The CAST statement converts types explicitly. Finally, CRYPTOLINE also sup- 
ports special statements (ASSERT and ASSUME) for verification purposes. 


2.2 The Architecture of COQCRYPTOLINE 


CoQCRYPTOLINE reduces the verification problem of a CRYPTOLINE specifi- 
cation to instances of root entailment problems and SMT problems over the 
QF _ BV logic. These instances are then solved by respective certified techniques. 
Moreover, the components in COQCRYPTOLINE are also specified and verified 
by the proof assistant CoQ with MaTHComp [7,17]. Figure 1 gives an overview 
of COQCRYPTOLINE. In the figure, dashed components represent external tools. 
Rectangular boxes are verified components and rounded boxes are unverified. 
Note that all our proof efforts using COQ are transparent to users. No COQ 
proof is required from users during verification of cryptographic programs with 
CoQCrYPTOLINE. Details can be found in [36]. 

Starting from a CRYPTOLINE specification text, the COQCRYPTOLINE parser 
translates the text into an abstract syntax tree defined in the COQ module DSL. 
The module gives formal semantics for the typed CRYPTOLINE language [16]. 
The validity of CRYPTOLINE specifications is also formalized. Similar to most 
program verification tools, COQCRYPTOLINE transforms CRYPTOLINE specifi- 
cations to the static single assignment (SSA) form. The SSA module gives our 
transformation algorithm. It moreover shows that validity of CRYPTOLINE speci- 
fications is preserved by the SSA transformation. COQCRYPTOLINE then reduces 
the verification problem via two COQ modules. 

The SSA2ZSSA module contains our algebraic reduction to the root entail- 
ment problem. Concretely, a system of (modular) equations is constructed from 
the given program so that program executions correspond to the roots of the 
system of (modular) equations. To verify algebraic post-conditions, it suffices to 
check if the roots for executions are also roots of (modular) equations in the 
post-condition. However, program executions can deviate from roots of (modu- 
lar) equations when over- or under-flow occurs. COQCRYPTOLINE will generate 
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Fig. 1. Overview of COQCRYPTOLINE 


soundness conditions to ensure the executions conform to our (modular) equa- 
tions. The algebraic verification problem is thus reduced to the root entailment 
problem provided that soundness conditions hold. 

The SSA2QFBV module gives our bit-vector reduction to the SMT QF_ BV 
problem. It constructs an SMT query to check the validity of the given CRYPTO- 
LINE range specification. Concretely, an SMT QF_ BV query is built such that 
all program executions correspond to satisfying assignments to the query and 
vice versa. To verify the range post-conditions, it suffices to check if satisfying 
assignments for the query also satisfy the post-conditions. The range verifica- 
tion problem is thus reduced to the SMT QF _ BV problem. On the other hand, 
additional SMT queries are constructed to check soundness conditions for the 
algebraic reduction. We formally prove the equivalence between soundness con- 
ditions and corresponding queries. 

With the two formally verified reduction algorithms, it remains to solve the 
root entailment problems and the SMT QF _ BV problems with external solvers. 
CoQCryYPTOLINE invokes an external computer algebra system (CAS) to solve 
the root entailment problems, and improves the techniques in [20,37] to validate 
the (untrusted) returned answers. Currently, the CAS SINGULAR [19] is sup- 
ported. To solve the SMT QF_ BV problems, COQCRYPTOLINE employs the 
certified SMT QF_ BV solver CoQQF BV [33]. In all cases, instances of the two 
kinds of problems are solved with certificates. And COQCRYPTOLINE employs 
verified certificate checkers to validate the answers to further improve assurance. 

Note that the algebraic reduction in SSA2ZSSA is sound but not complete due 
to the abstraction of bit-accurate semantics into (modular) polynomial equations 
over integers. Thus a failure in solving the root entailment problem by CAS does 
not mean that the algebraic post-conditions are violated. On the other hand, the 
bit-vector reduction in SSA2QFBV is both sound and complete. 

The COQCRYPTOLINE tool is built on OCAML programs extracted from 
verified algorithms in COQ with MATHCOMP. We moreover integrate the OCAML 
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programs from the certified SMT QF_BV solver CoQQFBV. Our trusted 
computing base consists of (1) COQCRYPTOLINE parser, (2) text interface with 
external SAT solvers (from CoQQFBV), (3) the proof assistant ISABELLE [29] 
(from the SAT solver certificate validator GRAT used by COQQFBV) and (4) the 
CoQ proof assistant. Particularly, sophisticated decision procedures in external 
CASs and SAT solvers used in COQQFBV need not be trusted. 


2.3 Features and Optimizations 


CoQCRYPTOLINE comes with the following features and optimizations imple- 
mented in its modules. 


Type System. COQCRYPTOLINE fully supports the type system of the CRYP- 
TOLINE language. The type system is used to model bit-vectors of arbitrary 
bit-widths with unsigned or signed interpretation. Such a type system allows 
CoQCRYPTOLINE to model more industrial examples translated from C pro- 
grams via GCC [16] or LLVM [24] compared to BVCRYPTOLINE [37], which only 
allows unsigned bit-vectors, all of the same bit-width. 


Mixed Theories. With the ASSERT and ASSUME statements supported by 
CoQCryYPTOLINE, it is possible to make an assertion on the range side (or 
on the algebraic side) and then make an equivalent assumption on the alge- 
braic side (or resp. on the range side). With this feature, a predicate can be 
asserted on one side where the predicate is easier to prove, and then assumed 
on the other side to ease the verification of other predicates. The equivalence 
between the asserted predicate and the assumed predicate is currently not ver- 
ified by COQCRYPTOLINE, though it is achievable. Both ASSERT and ASSUME 
statements are not available in BVCRYPTOLINE. 


Multi-threading. All extracted OCAML code from the verified algorithms in Coq 
runs sequentially. To speed up, SMT QF _ BV problems, as well as root entail- 
ment problems, are solved parallelly. 


Efficient Root Entailment Problem Solving. COQCRYPTOLINE can be used as 
a solver for root entailment problems with certificates validated by a verified 
validator. A root entailment problem is reduced to an ideal membership problem, 
which is then solved by computing Grébner basis [20]. To solve a root entailment 
problem with a certificate, we need to find a witness of polynomials co,..., Cn 
such that 

q = LiRoGiPi (1) 


where q and p,’s are given polynomials. To compute the witness, BV CRYPTOLINE 
relies on gbarith [32], where new variables are introduced. COQCRYPTOLINE 
utilizes the 1ift command in SINGULAR instead without adding fresh variables. 
We show in the evaluation section that using lift is more efficient than using 
gbarith. The witness found is further validated by COQCRYPTOLINE, which 
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relies on the polynomial normalization procedure norm_subst in Coq to check 
if Eq. 1 holds. BVCRYPTOLINE on the other hand uses the ring tactic in Coq, 
where extra type checking is performed. Elimination of ideal generators through 
variable substitution is an efficient approach to simplify an ideal membership 
problem [37]. The elimination procedure implemented in COQCRYPTOLINE can 
identify much more variable substitution patterns than those found by BVCRYP- 
TOLINE. 


Multi-moduli. Modular equations with multi-moduli are common in post- 
quantum cryptography. For example, the post-quantum cryptosystem KYBER 
uses the polynomial ring Z3329[X]/(X7°° + 1) containing two moduli 3329 and 
X?56 +1. To support multi-moduli in CoQCRYPTOLINE, in the proof of our alge- 
braic reduction, we have to find integers co, ..., Cn such that e1 — e2 = Xi gem, 
given the proof of e} = e2 (mod mo,...,7™m™n) where e1, e2, and m,’s are inte- 
gers. Instead of implementing a complicated procedure to find the exact c;’s, we 
simply invoke the xchoose function provided by MATHCOmP to find c;’s based 
on the proof of e1 = e2 (mod mpo,...,m,). Multi-moduli is not supported by 
BVCRYPTOLINE. 


Tight Integration with COQQFBV. COQCRYPTOLINE verifies every atomic 
range predicate separately using the certified SMT QF _ BV solver COQQFBV. 
Constructing a text file as the input to COQQFBV for every atomic range 
predicate is not a good idea because the bit-blasting procedure in COQQFBV 
is performed several times for the identical program. COQCRYPTOLINE thus is 
tightly integrated with COQQFBV to speed up bit-blasting of the same program 
using the cache provided by COQQFBV. BVCRYPTOLINE uses the SMT solver 
BOOLECTOR to prove range predicates without certificates. 


Slicing. During the reductions from the verification problem of a CRYPTO- 
LINE specification to instances of root entailment problems and SMT QF_ BV 
problems, a verified static slicing is performed in COQCRYPTOLINE to produce 
smaller problems. Unlike the work in [11], which sets all ASSUME statements as 
additional slicing criteria, the slicing in COQCRYPTOLINE is capable of pruning 
unrelated predicates in ASSUME statements. The slicing procedure implemented 
in COQCRYPTOLINE is much more complicated than the one in BVCRYPTOLINE 
due to the presence of ASSUME statements. This feature is provided as command- 
line option because it makes the verification incomplete. With slicing, the time 
in verifying industrial examples is reduced dramatically. 


3 Walkthrough 


We illustrate how COQCRYPTOLINE is used in this section. The x86_64 assembly 
subroutine ecp_nistz256_mul_montx from OPENSSL [30] shown in Fig. 2 is 
verified as an example. 

An input for COQCRYPTOLINE contains a CRYPTOLINE specification for the 
assembly subroutine. The original subroutine is marked between the comments 
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PROGNAME STARTS and PROGNAME ENDS, which is obtained automatically from 
the Python script provided by CRYPTOLINE [31]. 

Prior to the “START” comment are the parameter declaration, pre-condition, 
and variable initialization. After the “END” comment is the post-condition of 
the subroutine. After the subroutine ends, the result is moved to the output 
variables. 

The assembly subroutine ecp_nistz256_mul_montx takes two 256-bit 
unsigned integers a and b and the modulus m as inputs. The 256-bit integer 
m is the prime p256 = 2756 — 2224 4 2192 4 996 _ 1 from the NIST curve. The 
256-bit integers a and b (less than the prime) are the multiplicands. Each 256-bit 
input integer d € {a,b,m} is denoted by four 64-bit unsigned integer variables 
di (for 0 < i < 4) in little-endian representation. The expression limbs n [do, 
di, ..., dil is short for dọ + djx2**n + ... +d;*2**(i*n)!. The inputs and 
constants are then put in the variables for memory cells with the MOV state- 
ments. There are two parts to a pre-condition. The first part is for the algebraic 
reduction; the second part is for the bit-vector reduction: 


and [ m0=Oxffffffffffffffff, m1=O$\,\times\, $00000000ffffffff, 
m2=0$\, \times\,$0000000000000000, m3=Oxffffffff00000001 ] 

&& 

and [ mO=OxffffffffffffffffO64, m1=0$\,\times\,$00000000ffffffff064, 
m2=0$\, \times\ ,$0000000000000000@64, m3=Oxffffffff00000001064, 
limbs 64 [a0,a1,a2,a3] <u limbs 64 [m0,m1,m2,m3], 
limbs 64 [b0,b1,b2,b3] <u limbs 64 [m0,m1,m2,m3] ] 


The output 256-bit integer represented by the four variables c; (for 0 < i < 4) 
has two requirements. Firstly, the output integer times 2256 equals the product 
of the input integers modulo p256. Secondly, the output integer is less than p256. 
Formally, we have this post-condition: 


eqmod limbs 64 [0, 0, 0, 0, cO, c1, c2, c3] 
limbs 64 [a0, ai, a2, a3] * limbs 64 [b0, b1, b2, b3] 
limbs 64 [m0, m1, m2, m3] 

&& 

limbs 64 [c0, c1, c2, c3] <u limbs 64 [m0, mi, m2, m3] 


Here, we employ the algebraic reduction to verify the non-linear modular 
equality, and the bit-vector reduction to verify the proper range of the output 
integer. 

However, verifying ecp_nistz256_mul_montx takes extra annotations to hint 
CoQCRYPTOLINE how to verify the post-condition. E.g., in adding two 256- 
bit integers represented by 64-bit variables, a chain of four 64-bit additions is 
performed and carries are propagated. The last carry as the chain ends must be 
zero or the 256-bit sum is incorrect. In ecp_nistz256_mul_montx two interleaved 
addition chains use the carry and the overflow flags for carries respectively, so 
we annotate as follows at the end of two interleaving addition chains to tell 
CoOQCRYPTOLINE about the final carries: 


1 ** is the exponentiation operator in CRYPTOLINE. 
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adcs carry r9 r9 rcx carry; 
adcs overflow r10 r10 rbp overflow; 


proc main 


(uint64 a0, uint64 al, uint64 a2, uint64 a3, 


uint64 b0, uint64 bl, uint64 b2, uint64 b3, mull rbp rcx rdx L0Ox7fffffffd9b8; 

uint64 m0, uint64 ml, uint64 m2, uint64 m3) = adcs carry r10 r10 rcx carry; 

{ and [ m0 = Oxffffffffffffffff, adcs overflow r11 r11 rbp overflow; 
ml = 0x00000000ffffffff, mull rbp rcx rdx LOx7fffffffd9c0; 
m2 = 0x0000000000000000, adcs carry r11 r11 rcx carry; 
m3 = Oxffffffff00000001 | adcs overflow r12 r12 rbp overflow; 


&& mull rbp rcx rdx LOx7fffffffd9c8; 


and [ m0 = Oxffffffffffffffff064, mov rdx r9; 
ml = 0x00000000ffffffff@64, 
m2 = 0x0000000000000000@64, adcs carry r12 r12 rcx carry; 
m3 = Oxffffffff00000001064, split ddc rcx r9 32; 


shl rcx rex 32; 
adcs overflow r13 r13 rbp overflow; 
split rbp de r9 32; 


limbs 64 [a0, al, a2, a3] <u 

limbs 64 [m0, ml, m2, m3], 

limbs 64 [b0, bl, b2, b3] <u 

limbs 64 [m0, ml, m2, m3] ] } 
assert true && rbp=ddc; 
mov LOx7£££££f£A9b0 a0; assume rbp=ddc && true; 
mov LOx7fffffffd9c0 a2; 
mov LOx7£££££f£d9d0 b0; 
mov LOxX7fffffffd9e0 b2; 


mov L0Ox7fffffffd9b8 al; 
mov LOx7fffffffd9c8 a3; 
mov LOx7fffffffd9d8 b1; 
mov LOx7fffffffd9e8 b3; 


adcs carry r13 r13 r8 carry; 
adcs overflow r8 r8 r8 overflow; 


mov L0x55555557c000 Oxffffffffffffffff@uint64; 
mov L0x55555557c008 0x00000000ffffffff@uint64; 
mov L0x55555557c010 0x0000000000000000@uint64; 
mov L0x55555557c018 Oxffffffff00000001@uint64; 


assert true && and [carry=0@1, overflow=0@1]; 
assume and [carry=0,overflow=0] && true; 


(+ ecp_nistz256_mul_montx STARTS +) 


mov 
mov 
mov 
mov 
mov 


rdx L0Ox7fffffffda9d0; 
r9 LOxX7fffffffd9b0; 
r10 LOx7fffffffd9b8; 
r11 LOx7fffffffd9c0; 
r12 LOx7fffffffd9c8; 


mov L0Ox7fffffffda00 r8; 
mov L0Ox7fffffffda08 r9; 
(* ecp_nistz256_mul_montx ENDS +) 


mov c0 LOxX7fffffffd9f0; 
mov cl LOxX7fffffffd9f8; 


mull r9 r8 rdx r9; mov c2 L0x7fffffffda00; 
mull r10 rcx rdx r10; mov c3 L0Ox7fffffffda08; 
mov r14 0x20@uint64; 
mov r13 O@uint64; { eqmod limbs 64 [0, 0, 0, 0, c0, cl, c2, c3] 
limbs 64 [a0, al, a2, a3] + 
limbs 64 [b0, b1, b2, b3] 
limbs 64 [m0, ml, m2, m3] 
mov r8 0@uint64; && 
clear carry; 
clear overflow; 
mull rbp rcx rdx LOx7fffffffd9b0; 


limbs 64 [c0, cl, ¢2, c3] <u 
limbs 64 [m0, ml, m2, m3] } 


Fig. 2. CRYPTOLINE Model for ecp_nistz256_mul_montx 


assert true && and [ carry=0@1, overflow=001 ]; 
assume and [ carry=0, overflow=0 ] && true; 


The ASSERT statement verifies that both the carry and overflow flags are 
zeroes through the bit-vector reduction. The ASSUME statement then passes this 
information to the algebraic reduction. Effectively, COQCRYPTOLINE checks that 
both flags are zero for all inputs satisfying the pre-condition, then uses those facts 
as lemmas to verify the post-condition with the algebraic reduction. 

The full specification for ecp_nistz256_mul_montx has 230 lines, including 
50 lines of manual annotations. 20 are straightforward annotations for variable 
declaration and initialization. The remaining 30 lines of annotations are hints to 
CoQCryYPTOLINE, which then verifies the post-condition in 30s with 24 threads. 

The illustration of the typical verification flow shows how a user constructs 
a CRYPTOLINE specification. The pre-condition for program inputs, the post- 
condition for outputs, and variable initialization must be specified manually. 
Additional annotations may be added as hints. Notice that hints only tell 
CoQCRYPTOLINE what, not why properties should hold. Proofs of annotated 
hints and the post-condition are found by COQCRYPTOLINE automatically. Con- 
sequently, manual annotations are minimized and verification efforts are reduced 
significantly. 
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4 Evaluation 


We evaluate COQCRYPTOLINE on 52 benchmarks from four industrial security 
libraries BITCOIN [35], BORINGSSL [14,18], Nss [25], and OPENSSL [80]. The C 
reference and optimized avx2 implementations of the Number-Theoretic Trans- 
form (NTT) from the post-quantum key encapsulation mechanism KYBER [10] 
are also evaluated. Among the total 54 benchmarks, 43 benchmarks contain fea- 
tures not supported by BVCRYPTOLINE such as signed variables. All experiments 
are performed on an Ubuntu 22.04.1 machine with a 3.20GHz Intel Xeon Gold 
6134M CPU and 1TB RAM. 

Benchmarks from security libraries are various field and group operations 
from elliptic curve cryptography (ECC). In ECC, rational points on curves are 
represented by elements in large finite fields. In BITCOIN, the finite field is the 
residue system modulo the prime p256k1 = 2756 — 232 — 29 — 98 _ 97 — 26 _94_1. 
For other security libraries (BORINGSSL, Nss, and OPENSSL), we verify the 
operations in Curve25519 using the residue system modulo the prime 25519 = 
2755 _ 19 as the underlying field. Rational points on elliptic curves form a group. 
The group operation in turn is implemented by a number of field operations. 

In lattice-based post-quantum cryptosystems, polynomial rings are used. 
Specifically, the polynomial ring Z3329[X]/(X?°® + 1) is used in KyBER. To 
speed up multiplication in the polynomial ring, KYBER requires the multiplica- 
tion to be implemented by NTT. NTT is a discrete Fast Fourier Transform over 
finite fields. Instead of complex roots of unity, NTT uses the principal roots of 
unity in fields. Mathematically, the KYBER NTT computes the following ring 
isomorphism 


Z3329[X]/(X?°° + 1) S Zz329|X]/(X? — Co) x +++ x Z3329[X]/(X? — C127) 


where ¢;’s are the principal roots of unity. 

We first compare COQCRYPTOLINE with all optimizations described in this 
paper against the unverified model checker CRYPTOLINE [16]. Both tools invoke 
the computer algebra system SINGULAR [19], but CRYPTOLINE neither lets SIN- 
GULAR produce certificates nor certifies answers from SINGULAR. COQCRYPTO- 
LINE moreover uses the certified SMT QF _ BV solver COQQFBV [33]; CRyP- 
TOLINE uses the uncertified but very efficient BOOLECTOR [28]. 

For the ECC experiments, COQCRYPTOLINE verifies all field operations in 
6 minutes. It takes a few thousand seconds to verify group operations. The 
most complex implementation (x25519_scalar_ mult _ generic) from BORINGSSL 
(4274 statements) takes about 1.5 hours.? For KYBER, COQCRYPTOLINE verifies 
in 2642 and 1048 seconds, respectively, that the reference and avx2 NTT imple- 
mentations indeed compute the isomorphism. The unverified CRYPTOLINE in 
comparison finishes verification in about 95 seconds. A summary of the compar- 
ison between COQCRYPTOLINE and CRYPTOLINE is shown in Fig. 3a. Though 
CoQCryYPTOLINE is much slower than CRYPTOLINE, the running time (1.5 
hours) for the most complex implementation is still acceptable. 


2 Two (out of three) modular polynomial equations in the post-condition are certified. 
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Fig. 3. Running time (in seconds) comparisons 


Figure 3b shows the percentages of average running time for COQCRYPTO- 
LINE internal OCAML code (INT), external SMT QF _ BV solver (SMT), and 
external computer algebra system (CAS). External solvers take much more time 
than the internal OCAML program does. Between external solvers, the exter- 
nal computer algebra system takes 4.63% of the time and the external SMT 
QF _ BV solver spends 93.28% of the time. 

To show the performance of the lift optimization, we run COQCRYPTOLINE 
and BVCRYPTOLINE on root entailment problems generated from the bench- 
marks. Here we only consider 12 root entailment problems that trigger gbarith 
in BVCRYPTOLINE. Figure 3c shows the running time of SINGULAR in solving 
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root entailment problems based on gbarith in BVCRYPTOLINE and lift in 
CoQCRYPTOLINE. BVCRYPTOLINE fails to solve 3 root entailment problems in 
one hour. For the other 9 root entailment problems, lift outperforms gbarith. 

We also compare COQCRYPTOLINE with and without slicing. The version of 
CoQCRYPTOLINE without slicing is denoted by COQCRYPTOLINE~. The run- 
ning time comparison between COQCRYPTOLINE and COQCRYPTOLINE™ in 
Fig. 3d shows that slicing reduces the running time obviously. 


5 Conclusion 


CoQCRYPTOLINE is a verified model checker for cryptographic programs with 
certified results. Its modules are formally verified in CoQ with MATHCOMP. 
CoQCRYPTOLINE moreover employs external tools and validates their answers 
with certificates. We evaluate COQCRYPTOLINE on benchmarks from indus- 
trial security libraries (BITCOIN, BORINGSSL, NSS and OPENSSL) and a 
post-quantum cryptography standard candidate (KYBER). In our experiments, 
CoQCRYPTOLINE verifies most cryptographic programs with certificates in a 
reasonable time (7min). Benchmarks with thousands of lines are verified in 
1.7h. To our knowledge, this is the first certified verification on operations of 
the elliptic curve secp256k1 used in BITCOIN, and the avx2 and reference imple- 
mentations of KYBER number-theoretic transform. 
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Abstract. Identifying live and dead states in an abstract transition sys- 
tem is a recurring problem in formal verification; for example, it arises in 
our recent work on efficiently deciding regex constraints in SMT. How- 
ever, state-of-the-art graph algorithms for maintaining reachability infor- 
mation incrementally (that is, as states are visited and before the entire 
state space is explored) assume that new edges can be added from any 
state at any time, whereas in many applications, outgoing edges are 
added from each state as it is explored. To formalize the latter situa- 
tion, we propose guided incremental digraphs (GIDs), incremental graphs 
which support labeling closed states (states which will not receive further 
outgoing edges). Our main result is that dead state detection in GIDs 
is solvable in O(logm) amortized time per edge for m edges, improv- 
ing upon O(,/m) per edge due to Bender, Fineman, Gilbert, and Tarjan 
(BFGT) for general incremental directed graphs. 

We introduce two algorithms for GIDs: one establishing the logarith- 
mic time bound, and a second algorithm to explore a lazy heuristics- 
based approach. To enable an apples-to-apples experimental compari- 
son, we implemented both algorithms, two simpler baselines, and the 
state-of-the-art BFGT baseline using a common directed graph interface 
in Rust. Our evaluation shows 110-530x speedups over BFGT for the 
largest input graphs over a range of graph classes, random graphs, and 
graphs arising from regex benchmarks. 


Keywords: Dead State Detection - Graph Algorithms - Online 
Algorithms - SMT 


1 Introduction 


Classifying states in a transition system as live or dead is a recurring problem in 
formal verification. For example, given an expression, can it be simplified to the 
identity? Given an input to a nondeterministic program, can it reach a terminal 
state, or can it reach an infinitely looping state? Given a state in an automaton, 
can it reach an accepting state? State classification is relevant to satisfiability 
modulo theories (SMT) solvers [8,9,24,51], where theory-specific partial decision 
procedures often work by exploring the state space to find a reachable path that 
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corresponds to a satisfying string or, more generally, a sequence of constructors. 
To a first approximation, the core problem in all of these cases amounts to 
classifying each state u in a directed graph as live, meaning that a feasible, 
accepting, or satisfiable state is reachable from u; or dead, meaning that all 
states reachable from u are infeasible, rejecting, or unsatisfiable. 


Motivating Applications. We originally encountered the problem of incremen- 
tal state classification during our prior work while building Z3’s regex solver [61] 
for the SMT theory of string and regex constraints [4,13,15]. Our solver lever- 
aged derivatives (in the sense of Brzozowski [18] and Antimirov [5]) to explore 
the states of the finite state machine corresponding to the regex incrementally 
(as the graph is built), to avoid the prohibitive cost of expanding all states ini- 
tially. This turns out to require solving the live and dead state detection problem 
in the finite state machine presented as an incremental directed graph.! Con- 
cretely, consider the regex (.*a.!9°)° N (œ), where . matches any character, N 
is regex intersection, © is regex complement, and œ matches any digit (0-9). A 
traditional solver would expand the left and right operands as state machines, 
but the left operand (.*a.!°°)° is astronomically large as a DFA, causing the 
solver to hang. The derivative-based technique instead constructs the derivative 
regex: (.*a.10)° A (,100)C N a. At this stage we have a graph of two states and 
one edge, where the states are the two regexes just described, and the edge is 
the derivative relation. After one more derivative operation, the regex is reduced 
to one that is clearly nonempty as it accepts the empty string. 

It is important that a derivative-based solver identify nonempty (live) and 
empty (dead) regexes incrementally because it does not generally construct the 
entire state space before terminating (see the graph update rule UPD, p. 626 [61]). 
Moreover, the nonemptiness problem for extended regexes is non-elementary [62] 
— and still PSPACE-complete for more restricted fragments — which strongly 
favors a lazy approach over brute-force search. 

Regexes are just one possible application; the algorithms we will present here 
are broadly applicable to any context where the states have a bounded (per- 
node) out-degree. For example, they could be applied in LTL model checking 
when lazily exploring the state space of a nondeterministic Biichi automaton 
(NBA), where the NBA is too expensive to construct up front. The important 
fact is that each state of the automaton has only finitely many outgoing edges, 
and when all these are added, we can hope to check for dead states incrementally. 


Prior Work. Traditionally, while live state detection can be done incremen- 
tally, dead state detection is often done exhaustively (i.e., after the entire state 
space is explored). For example, bounded and finite-state model checkers based 
on translations to automata [20,43,58], as well as classical dead-state elimina- 
tion algorithms [12,16,37], typically work on a fixed state space after it has 
been fully enumerated. However, we reiterate that exhaustive exploration is pro- 
hibitive for large (e.g., exponential or infinite) state spaces which arise in an SMT 


' The specific setting is regexes with intersection and complement (extended [31,44] 
or generalized [26] regexes), which are found natively in security applications [6,61]. 
Other solvers have also leveraged derivatives [45] and laziness in general [36]. 


Incremental Dead State Detection in Logarithmic Time 243 


Na 


Fig. 1. GID consisting of the sequence of updates E(1, 2), E(1, 3), T(2). Terminal states 
are drawn as double circles. After the update T(2), states 1 and 2 are known to be live. 
State 3 is not dead in this GID, as a future update may cause it to be live. 
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Fig. 2. GID extending Fig. 1 with additional updates E(4, 3), E(4, 5), C(4), C(5). Closed 
states are drawn as solid circles. After the update C(5) (but not earlier), state 5 is dead. 
State 4 is not dead because it can still reach state 3. 
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verification context. We also have good evidence that incremental feedback 
can improve SMT solver performance: a representative success story is the e- 
graph data structure [23,67], which maintains an equivalence relation among 
expressions incrementally; because it applies to general expressions, it is theory- 
independent and re-usable. Incremental state space exploration could lead to 
similar benefits if applied to SMT procedures which still rely on exhaustive 
search. 

However, in order to perform incremental dead state detection, we cur- 
rently lack algorithms which match offline performance. As we discuss in Sect. 2, 
the best-known existing solutions would require maintaining strong connected 
components (SCCs) incrementally. For SCC maintenance and the related sim- 
pler problem of cycle detection, amortized algorithms are known with O(m°/?) 
total time for m edge additions [10,33], with some recently announced improve- 
ments [11,14]. Note that this is in sharp contrast to O(m) for the offline variants 
of these problems, which can be solved by breadth-first or depth-first search. 
More generally, research suggests there are computational barriers to solving 
unconstrained reachability problems in incremental and dynamic graphs [1,29]. 


This Paper. To improve on prior algorithms, our key observation is that in 
many applications (including our motivating applications above), edges are not 
added adversarially, but from one state at a time as the states are explored. As 
a result, we know when a state will have no further outgoing edges. This enables 
us to (i) identify dead states incrementally, rather than only after the whole 
state space is explored; and (ii) obtain more efficient algorithms than currently 
exist for general graph reachability. 

We introduce guided incremental digraphs (GIDs), a variation on incremental 
graphs. Like an incremental directed graph, a guided incremental digraph may be 
updated by adding new edges between states, or a state may be labeled as closed, 
meaning it will receive no further outgoing edges. Some states are designated as 
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terminal, and we say that a state is live if it can reach a terminal state and 
dead if it will never reach a terminal state in any extension — i.e. if all reachable 
states from it are closed (see Figs. 1 and 2). To our knowledge, the problem of 
detecting dead states in such a system has not been studied by existing work in 
graph algorithms. Our problem can be solved through solving SCC maintenance, 
but not necessarily the other way around (Sect. 2, Proposition 1). We provide 
two new algorithms for dead-state detection in GIDs. 

First, we show that the dead-state detection problem for GIDs can be solved 
in time O(m-logm) for m edge additions, within a logarithmic factor of the 
O(m) cost for offline search. The worst-case performance of our algorithm thus 
strictly improves on the O(m°/?) upper bound for SCC maintenance in gen- 
eral incremental graphs. Our algorithm is technically sophisticated, and utilizes 
several data structures and existing results in online algorithms: in particular, 
Union-Find [63] and Henzinger and King’s Euler Tour Trees [35]. The main idea 
is that, rather than explicitly computing the set of SCCs, for closed states we 
maintain a single path to a non-closed (open) state. This turns out to reduce the 
problem to quickly determining whether two states are currently assigned a path 
to the same open state. On the other hand, Euler Tour Trees can solve undirected 
reachability for graphs that are forests in logarithmic time. The challenge then 
lies in figuring out how to reduce directed connectivity in the graph of paths to 
an undirected forest connectivity problem. At the same time, we must maintain 
this reduction under Union-Find state merges, in order to deal with cycles that 
are found in the graph along the way. 

While as theorists we would like to believe that asymptotic complexity is 
enough, the truth is that the use of complex data structures (1) can be pro- 
hibitively expensive in practice due to constant-factor overheads, and (2) can 
make algorithms substantially more difficult to implement, leading practition- 
ers to prefer simpler approaches. To address these needs, in addition to the 
logarithmic-time algorithm, we provide a second lazy algorithm which avoids 
the user of Euler Tour Trees, and only uses union-find. This algorithm is based 
on an optimization of adding shortcut jump edges for long paths in the graph to 
quickly determine reachability. This approach aims to perform well in practice 
on typical graphs, and is evaluated in our evaluation along with the logarithmic 
time algorithm, though we do not prove its asymptotic complexity. 

Finally, we implement and empirically evaluate both of our algorithms for 
GIDs against several baselines in 5.5k lines of code in Rust [47]. Our evaluation 
focuses on the performance of the GID data structure itself, rather than its end- 
to-end performance in applications. To ensure an apples-to-apples comparison 
with existing approaches, we put particular focus on providing a directed graph 
data structure backend shared by all algorithms, so that the cost of graph search 
as well as state and edge merges is identical across algorithms. We implement 
two naive baselines, as well as an implementation of the state-of-the-art solution 
? Reachability in dynamic forests can also be solved by Sleator-Tarjan trees [59], 
Frederickson’s Topology Trees [30], or Top Trees [3]. Of these, we found Euler Tour 
Trees the easiest to work with in our implementation. See also [64]. 
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based on maintaining SCCs, BFGT [10] in our framework. To our knowledge, 
the latter is the first implementation of BFGT specifically for SCC maintenance. 
On a collection of generated benchmark GIDs, random GIDs, and GIDs directly 
pulled from the regex application, we demonstrate a substantial improvement 
over BFGT for both of our algorithms. For example, for larger GIDs (those with 
over 100K updates), we observe a 110-530x speedup over BFGT. 


Contributions. Our primary contributions are: 


— Guided incremental digraphs (GIDs), a formalization of incremental live and 
dead state detection which supports labeling closed states. (Section 2) 

— Two algorithms for the state classification problem in GIDs: first, an algo- 
rithm that works in amortized O(log m) time per update, improving upon 
the state-of-the-art amortized O(,/m) per update for incremental graphs; 
and second, a simpler algorithm based on lazy heuristics. (Section 3) 

— An open-source implementation? of GIDs in Rust, and an evaluation which 
demonstrates up to two orders of magnitude speedup over BFGT. (Section 4) 


Following the above, we expand on the application of GIDs to regex solving 
in SMT (Sect. 5) and survey related work (Sect. 6). 


2 Guided Incremental Digraphs 


2.1 Problem Statement 


An incremental digraph is a sequence of edge updates E(u, v), where the algo- 
rithmic challenge in this context is to produce some output after each edge is 
received (e.g., whether or not a cycle exists). If the graph also contains updates 
T(u) labeling a state as terminal, then we say that a state is live if it can reach 
a terminal state in the current graph. In a guided incremental digraph, we also 
include updates C(u) labeling a state as closed, meaning that will not receive 
any further outgoing edges. 


Definition 1. Define a guided incremental digraph (GID) to be a sequence of 
updates, where each update is one of the following: 


(i) a new directed edge E(u, v); 
(ii) a label T(w) which indicates that u is terminal; or 
(iii) a label C(u) which indicates that u is closed, i.e. no further edges will be 
added going out from u (or labels to u). 


The GID is valid if the closed labels are correct: there are no instances of 
E(u, v) or T(u) after an update C(u). The denotation of G is the directed graph 
(V, E) where V is the set of all states u which have occurred in any update in 
the sequence, and F is the set of all (u,v) such that E(u,v) occurs in G. An 
extension of a valid GID G is a valid GID G’ such that G is a prefix of G’. 


3 https: //github.com/cdstanford/gid. 
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In a valid GID G, we say that a state u is live if there is a path from u to a 
terminal state in the denotation of G; and a state u is dead if it is not live in 
any extension of G. Notice that in a GID without any C(u) updates, no states 
are dead as an edge may be added in an extension which makes them live. 

We provide an example of a valid GID in Figs.1 and 2 consisting of the 
following sequence of updates: E(1, 2), E(1, 3), T(2), E(4, 3), E(4, 5), C(4), C(5). 
Terminal states T(w) are drawn as double circles; closed states, as single circles 
C(u); and states that are not closed, as dashed circles. 


Definition 2. Given as input a valid GID, the GID state classification problem 
is to output, in an online fashion after each update, the set of new live and new 
dead states. That is, output Live(u) or Dead(u) on the smallest prefix of updates 
such that u is live or dead on that prefix, respectively. 


2.2 Existing Approaches 


In many applications, one might choose to classify dead states offline, after the 
entire state space is enumerated. This leads to a linear-time algorithm via either 
DFS or BFS, but it does not solve our problem (Definition 2) because it is 
not incremental. Naive application of this idea leads to O(m) per update for m 
updates (O(m?) total), as we may redo the entire search after each update. 

For acyclic graphs, there exists an amortized O(1)-time per update algorithm 
for the problem (Definition 2): maintain the graph as a list of forward- and 
backward-edges at each state. When a state v is marked terminal, do a DFS 
along backward-edges to determine all states u that can reach v not already 
marked as live, and mark them live. When a state v is marked closed, visit 
all forward-edges from v; if all are dead, mark v as dead and recurse along all 
backward-edges from v. As each edge is visited only when marking a state live 
or dead, it is only visited a constant number of times overall (though we may 
use more than O(1) time on some particular update pass). Additionally, the live 
state detection part of this procedure still works for graphs containing cycles. 

The challenge, therefore, lies primarily in detecting dead states in graphs 
which may contain cycles. For this, the breakthrough approach from [10] main- 
tains a condensed graph which is acyclic, where the vertices in the condensed 
graph represent strongly connected components (SCCs) of states. The mapping 
from states to SCCs is maintained using a Union-Find [63] data structure. Main- 
taining the condensed graph requires O(,/m) time per update. To avoid confus- 
ing closed and non-closed states, we also have to make sure that they are not 
merged into the same SCC; the easiest solution to this is to withhold all edges 
from each state u in the graph until u are closed, which ensures that u must be in 
a SCC on its own. Once we have the condensed graph with these modifications, 
the same algorithm as in the previous paragraph works to identify live and dead 
states. Since each edge is only visited when a state is marked closed or live, each 
edge is visited only once throughout the algorithm, we use only amortized O(1) 
additional time to calculate live and dead states. While this SCC maintenance 
algorithm ignores the fact that edges do not occur from closed states C(u), this 
still proves the following result: 
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Live Some reachable state from u is terminal. 

Dead All reachable states from u (including u) are closed and not terminal. 
Unknown wu is closed, but not live or dead. 

Open u is not live and not closed. 


Terminal A state u labeled by T(u). 
Closed A state u labeled by C(u). 
Canonical A state x such that UF.find(x) = z. 


u, V, W States (may or may not be canonical). 

v, Y;Z Canonical states (i.e., states in the condensed graph). 

Successor For an unknown, canonical state x, a uniquely chosen v such that (x, v) 
succ(x) is an edge, and following the path of successors leads to an open state. 


Fig. 3. Top: Basic classification of GID states into four disjoint categories. Bottom: 
Additional terminology used in this paper. 


Proposition 1. GID state classification reduces to SCC maintenance. That is, 
suppose we have an algorithm over incremental graphs that maintains the set of 
SCCs in O(f(m,n)) total time given n states and m edge additions. Then there 
exists an algorithm to solve GID state classification in O(f(m,n)) total time. 


Despite this reduction one way, there is no obvious reduction the other way — 
from cycle detection or SCCs to Definition 2. This is because, while the existence 
of a cycle of non-live states implies bi-reachability between all states in the cycle, 
it does not necessarily imply that all of the bi-reachable states are dead. 


3 Algorithms 


This section presents Algorithm 2, which solves the state classification problem 
in logarithmic time (Theorem 3); and Algorithm 3, an alternative lazy solution. 
Both algorithms are optimized versions of Algorithm 1, a first-cut algorithm 
which establishes the structure of our approach. We begin by establishing some 
basic terminology shared by all of the algorithms (see Fig. 3). 

States in a GID can be usefully classified as exactly one of four statuses: 
live, dead, unknown, or open, where unknown means “closed but not yet live or 
dead”, and open means “not closed and not live”. Note that a state may be live 
and neither open nor closed; this terminology keeps the classification disjoint. 
Pragmatically, for live states it does not matter if they are classified as open or 
closed, since edges from those states no longer have any effect. However, all dead 
and unknown states are closed, and no states are both open and closed. 

Given this classification, the intuition is that for each unknown state u, we 
only need one path from u to an open state to prove that it is not dead; we want 
to maintain one such path for all unknown states. To maintain all of these paths 


4 To be precise, “maintains” means that (i) we can check whether two states are in 
the same SCC in O(1) time; and (ii) we can iterate over all the states, edges from, 
or edges into a SCC in O(1) time per state or edge. 
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simultaneously, we maintain an acyclic directed forest structure on unknown and 
open states where the roots are open states, and all non-root states have a single 
edge to another state, called its successor. Edges other than successor edges can 
be temporarily ignored, except for when marking live states; these are kept as 
reserve edges. Specifically, we add every edge (u,v) as a backward-edge from v 
(to allow propagating live states), but for edges not in the forest we keep (u, v) 
in a reserve list from u. We store all edges, including backward-edges, in the 
original order (u,v). The reserve list edge becomes relevant only when either (i) 
u is marked as closed, or (ii) ws successor is marked as dead. 

In order to deal with cycles, we need to maintain the forest of unknown states 
not on the original graph, but on a union-find condensed graph, similar to [63]. 
When we find a cycle of unknown states, we merge all states in the cycle by 
calling the union method in the union-find. We refer to a state as canonical if 
it is the canonical representative of its equivalence class in the union find; the 
condensed graph is a forest on canonical states. We use x, y, z to denote canonical 
states (states in the condensed graph), and u,v, w to denote the original states 
(not known to be canonical). Following [63], we maintain edges as linked lists 
rather than sets, and using the original states instead of canonical states; this is 
important as it allows combining edge lists in O(1) time when merging states. 


3.1 First-Cut Algorithm 


Algorithm 1 is a first cut based on these ideas. The procedures ONEDGE and 
ONTERMINAL contain all the logic to identify live states, using bck to look up 
backward-edges; ONTERMINAL doubles as a “mark live” function when it is 
called by ONEDGE. The procedure ONCLOSED tries to assign a successor edge 
to a newly closed state, to prove that it is not dead. In case we run out of 
reserve edges, the state is marked dead and we recursively call ONCLOSED along 
backward-edges, which will either set a new successor or mark them dead. 

The union-find data structure UF provides UF.union(v1, v2), UF.find(v), and 
UF.iter(v): UF.union merges vı and v2 to refer to the same canonical state, 
UF .find returns the canonical state for v, and UF. iter iterates over states equiv- 
alent to v. These use amortized a(n) for n updates, where a(n) € o(log n) is the 
inverse Ackermann function. We only merge states if they are bi-reachable from 
each other, and both unknown; this implies that all states equivalent to a state 
x have the same status. Each edge (u,v) is always stored in the maps res and 
bck using its original states (i.e., edge labels are not updated when states are 
merged); but we can quickly obtain the corresponding edge on canonical states 
via (UF.find(u),UF.find(v)). Once a state is marked Live or Dead, its edge 
maps are no longer used. 


Invariants. Altogether, we respect the following invariants. Successor and no 
cycles describe the forest structure, and, edge representation ensures that all 
edges in the input GID are represented somehow in the current graph. 


— Merge equivalence: For all states u and v, if UF.find(u) = UF.find(v), then 
u and v are bi-reachable and both closed. (This implies that u and v are both 
live, both dead, or both unknown.) 
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Algorithm 1. First-cut algorithm. 


1: V: a type for states (integers) (variables u,v, ...) 

2: E: the type of edges, equal to (V, V) 

3: UF: a union-find data structure over V 

4: X: the set of canonical states in UF (variables x,y, z,...) 
5: status: a map from X to Live, Dead, Unknown, or Open 
6: succ: a map from X to V 

7: res and bck: maps from X to linked lists of E 

8: procedure ONEDGE(E(u, v)) 

9: x — UF.find(u); y — UF.find(v) 

0: if status(y) = Live then 


1 ONTERMINAL(T(z)) > mark x and its ancestors live 
2: else if status(x) Æ Live then > status(x) must be Open 
3 append (u,v) to res(x) 
4 append (u,v) to bck(y) 


5: procedure ONTERMINAL(T(v)) 

6: y — UF.find(v) 

17: for all x in DFS backwards (along bck) from y not already Live do 
18: status(x) — Live 

19: output Live(z’) for all x’ in UF.iter(z) 

: procedure ONCLOSED(C(v)) 

y — UF .find(v) 


N NN 
H Oo 


Dok 


if status(y) Æ Open then return > y is already live or closed 
23: while res(y) is nonempty do 
24: pop (v,w) from res(y); z — UF.find(w) 
25: if status(z) = Dead then continue 
26: else if CHECKCYCLE(y, z) then 
27: for all z’ in cycle from z to y do z — MERGE(z, 2’) 
28: else 
29: status(y) — Unknown; succ(y) < z; 
30: return 
31: status(y) — Dead; output Dead(y’) for all y’ in UF.iter(y) 
32: ToRecurse — ð 
33: for all (u,v) in bck(y) do 
34: x — UF.find(u) 
35: if status(x) = Unknown and UF.find(succ(z)) = y then 
36: status(x) — Open > temporary — marked closed on recursive call 
37: add x to ToRecurse 
38: for all x in ToRecurse do ONCLOSED(C(z)) 


39: procedure CHECKCYCLE(y, z) returning bool 
0: while status(z) = Unknown do z — UF.find(succ(z)) > get root state from z 


1: return y = z 

2: procedure MERGE(z, y) returning V 

3: z — UF.union(z, y) 

4 bck(z) — bck(x) + bck(y) > O(1) linked list append 
5 res(z) — res(x) + res(y) > O(1) linked list append 


6: return z 
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— Status correctness: For all u, status(UF.find(u)) equals the status of u. 

— Successor edges: If x is unknown, then succ(x) is defined and is an unknown 
or open state. If x is open, then succ(x) is not defined. 

— No cycles: There are no cycles among the set of edges (x, UF.find(succ(x))), 
over all unknown and open canonical states x. 

— Edge representation: For all edges (u,v) in the input GID, at least one of the 
following holds: (i) (u,v) € res(UF.find(v)); (ii) v = succ(UF.find(u)); (iii) 
UF .find(u) = UF.find(v); (iv) u is live; or (v) v is dead. 


Theorem 1. Algorithm 1 is correct. 


Proof (Summary). The full proof can be found in the arXiv version [60]. The 
status correctness invariant implies correct output at each step, so it suffices to 
argue that all of the invariants above are preserved. Upon receiving E(u, v) or 
T(u), some dead, unknown, or open states may become live, but this does not 
change the status of any other states. The main challenge of the proof is the 
recursive procedure ONCLOSEDC(u). On recursive calls, some states are tem- 
porarily marked Open, meaning they are roots in the forest structure. During 
recursive calls, we need a slightly generalized invariant: each forest root corre- 
sponds to a pending call to ONCLOSEDC(w) (i.e., an element of ToRecurse for 
some call on the stack) and is a state that is dead iff all of its reserve edges are 
dead. After we prove this (generalized) invariant, when ONCLOSEDC(u) termi- 
nates, we know that there are no more temporary open states, and the forest 
structure implies that all closed states are correctly marked as unknown. 


Complexity. The core inefficiency in Algorithm 1 — what we need to improve 
— lies in CHECKCYCLE. The procedure repeatedly sets z — succ(z) to find 
the tree root, which in general could be linear time in the number of edges. 
For example, this inefficiency results in O(m?) work for a linear graph read in 
backwards order: E(2, 1), C(2), E(3, 2), C(3), ..., E(n, n-1), C(n). 

All other procedures use amortized a(m) time per update for m updates, 
using array lists to represent the maps fwd, bck, and succ for O(1) lookups. To 
do the amortized analysis, the cost of each call to ONCLOSED can be assigned 
either to the target of an edge being marked dead, or to an edge being merged 
as part of a cycle, and both of these events can only happen once per edge added 
to the GID. And the ONTERMINAL calls and loop iterations only run once per 
edge in the graph when the target of that edge is marked live or terminal. 


3.2 Logarithmic Algorithm 


At its core, CHECKCYCLE requires solving an undirected reachability problem on 
a graph that is restricted to a forest. However, the forest is changed not just by 
edge additions, but edge additions and deletions. While undirected reachability 
and reachability in directed graphs are both difficult to solve incrementally, 
reachability in dynamic forests can be solved in O(logm) time per operation. 
This is the main intuition for our solution, using an Euler Tour Trees data 
structure EF of Henzinger and King [35], shown in Algorithm 2. 
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Algorithm 2. Logarithmic time algorithm. 
1: All data from Algorithm 1; succ: a map from X to E (instead of to V) 
2: EF: Euler Tour Trees data structure providing: EF.add, EF.remove, EF.connected 
3: procedure ONEDGE, MERGE as in Algorithm 1 
4: procedure ONTERMINAL(T(v)) 
5: y < UF.find(v) 
6: for all x in DFS backwards (along bck) from y not already Live do 
7 if status(x) = Unknown then 
8: > The following line is not strictly necessary, but simplifies the analysis 
9: (u,v) — succ(x); delete succ(x); EF.remove(u, v) 


0: status(r) — Live; output Live(z’) for all x’ in UF.iter(z) 
1: procedure ONCLOSED(C(v)) 

2: y < UF.find(v) 

3: if status(y) Æ Open then return 

4: while res(y) is nonempty do 

5: pop (v, w) from res(y); z — UF.find(w) 

6: if status(z) = Dead then continue 

7: else if CHECKCYCLE(y, z) then 


8: for all z’ in cycle from z to y do z — MERGE(z, 2’) 

9: else 

20: status(x) — Unknown; succ(x) — (v, w) 

2i; EF .add(v, w); return > undirected edge; use original labels (not (x, y)) 


status(y) Dead; ToRec — Ø; output Dead(y’) for all y’ in UF.iter(y) 
for all (u,v) in bck(y) do 
x — UF.find(u) 
if status(z) = Unknown then 
(u’, v’) — succ(z) 
if UF.find(v’) = y then 
EF .remove(u’,v’); status(x) Open; delete succ(x); add x to ToRec 
for all x in ToRec do ONCLOSED(C(z)) 


30: procedure CHECKCYCLE(y, z) returning bool 
1: return EF.connected(y, z) 


e Ww N 


N 


YN N N NNN WN 
a = à 7 


pe 


ro 


Unfortunately, this idea does not work straightforwardly — once again because 
of the presence of cycles in the original graph. We cannot simply store the forest 
as a condensed graph with edges on condensed states. As we saw in Algorithm 
1, it was important to store successor edges as edges into V, rather than edges 
into X — this is the only way that we can merge states in O(1), without actually 
inspecting the edge lists. If we needed to update the forest edges to be in X, this 
could require O(m) work to merge two O(m)-sized edge lists as each edge might 
need to be relabeled in the EF graph. 

To solve this challenge, we instead store the EF data structure on the original 
states, rather than the condensed graph; but we ensure that each canonical state 
is represented by a tree of original states. When adding edges between canonical 
states, we need to make sure to remember the original label (u, v), so that we can 
later remove it using the original labels (this happens when its target becomes 
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dead). When an edge would create a cycle, we instead simply ignore it in the EF 
graph, because a line of connected trees forms a tree. 


Summary and Invariants. In summary, the algorithm reuses the data, proce- 
dures, and invariants from Algorithm 1, with the following important changes: 
(1) We maintain the EF data structure EF, a forest on V. (2) The successor edges 
are stored as their original edge labels (u,v), rather than just as a target state. 
(3) The procedure ONCLOSED is rewritten to maintain the graph EF. (4) The 
successor edges and no cycles invariants use the new succ representation: that 
is, they are constraints on the edges (%,UF.find(v)), where succ(z) = (u,v). 
(5) We add the following two constraints on edges in EF, depending on whether 
those states are equivalent in the union-find structure. 


— EF inter-edges: For all inequivalent u,v, (u,v) is in the EF if and only if 
(u,v) = succ(UF.find(wu)) or (v,u) = succ(UF.find(v)). 

— EF intra-edges: For all unknown canonical states x, the set of edges (u, v) in 
the EF between states belonging to x forms a tree. 


Theorem 2. Algorithm 2 is correct. 


Proof. Observe that the EF inter-edges constraint implies that EF only contains 
edges between unknown and open states, together with isolated trees. In the 
modified ONTERMINAL procedure, when marking states as live we remove inter- 
edges, so we preserve this invariant. 

Next we argue that given the invariants about EF, for an open state y the 
CHECKCYCLE procedure returns true if and only if (y, z) would create a directed 
cycle. If there is a cycle of canonical states, then because canonical states are 
connected trees in EF, the cycle can be lifted to a cycle on original states, so y and 
z must already be connected in this cycle without the edge (y, z). Conversely, if 
y and z are connected in EF, then there is a path from y to z, and this can be 
projected to a path on canonical states. However, because y is open, it is a root 
in the successor forest, so any path from y along successor edges travels only 
on backward-edges; hence z is an ancestor of y in the directed graph, and thus 
(y, z) creates a directed cycle. 

This leaves the ONCLOSED procedure. Other than the EF lines, the structure 
is the same as in Algorithm 1, so the previous invariants are still preserved, 
and it remains to check the EF invariants. When we delete the successor edge 
and temporarily mark status(x) = Open for recursive calls, we also remove it 
from EF, preserving the inter-edge invariant. Similarly, when we add a successor 
edge to x, we add it to EF, preserving the inter-edge invariant. So it remains to 
consider when the set of canonical states changes, which is when merging states 
in a cycle. Here, a line of canonical states is merged into a single state, and a 
line of connected trees is still a tree, so the intra-edge invariant still holds for 
the new canonical state, and we are done. 


Theorem 3. Algorithm 2 uses amortized logarithmic time per edge update. 


Proof. By the analysis of Algorithm 1, each line of the algorithm is executed 
O(m) times and there are O(m) calls to CHECKCYCLE. Each line of code is 
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Algorithm 3. Lazy algorithm. 
1: All data from Algorithm 1; jumps: a map from X to lists of V 
2: procedure ONEDGE, ONTERMINAL ONCLOSED as in Algorithm 1 
3: procedure CHECKCYCLE(y, z) returning bool 
4: return y = GETROOT(z) 


5: procedure GETROOT(z) returning V 

6: if status(z) = Open then return z 

7: if jumps(z) is empty then push succ(z) to jumps(z) > set Oth jump 
8: repeat pop w from jumps(z); 2’ = UF.find(w) > remove dead jumps 
9: until status(z’) # Dead 

0: push z’ to jumps(z); result — GETROOT(z’) 

1: n — length(jumps(z)); n’ — length(jumps(z’)) 

2: if n <n’ then push jumps(z’)[n — 1] to jumps(z) > set nth jump 
3: return result 

1: procedure MERGE(z, y) returning V 

z — UF.union(z, y) 

6: bck(z) — bck(x) + bck(y); res(z) — res(x) + res(y) 

7: jumps(z) +— empty; return z 


on 


either constant-time, a(m) = o(log m) time for the UF calls, or O(log m) time for 
the EF calls, so in total the algorithm takes O(m log m) time total, or amortized 
O(log m) time per edge. 


3.3 Lazy Algorithm 


While the asymptotic complexity of logm could be the end of the story, in 
practice, we found the cost of the EF calls to be a significant overhead. The 
technical details of Euler Tour Trees include building an AVL-tree cycle for each 
tree, where the cycle contains each state of the graph once and each edge in the 
graph twice. While this is elegant, it turns out that adding one edge to EF results 
in no less than seven modifications to the AVL tree: a split at the source, then 
a split at the target, then an edge addition in both directions (u,v) and (v, u) 
to the cycle, and finally the four resulting trees need to be glued together (using 
three merge operations). Each one of these operations comes with a rebalancing 
operation which could do Q(log m) tree rotations and pointer dereferences to visit 
the nodes in the AVL tree. Some optimizations may be possible — including, 
e.g., combining rebalancing operations or considering variants of AVL trees with 
better cache locality. Nonetheless, these constant-factor overheads constitute a 
serious practical drawback for Algorithm 2. 

To address this, in this section, we investigate a simpler, lazy algorithm which 
avoids EF and directly optimizes Algorithm 1. For this, one idea in the right 
direction is to store for each state a direct pointer to the root which results from 


5 Our implementation actually uses nine modifications, as the splits at the source and 
target also disconnect the source and target states. 
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repeatedly calling succ. But there are two issues with this. First, maintaining this 
may be difficult (when the root changes, potentially updating a linear number 
of root pointers). Second, the root may be marked dead, in which case we have 
to re-compute all pointers to that root. 

Instead, we introduce a jump list from each state: intuitively, it will contain 
states after calling successor once, twice, four times, eight times, and so on at 
powers of two; and it will be updated lazily, at most once for every visit to 
the state. When a jump becomes obsolete (the target dead), we just pop off 
the largest jump, so we do not lose all of our work in building the list. We 
maintain the following additional information: for each unknown canonical state 
x, a nonempty list of jumps [vo, v1, V2,---, Uz], such that vo is reachable from z, 
vı is reachable from vo, v2 is reachable from vı, and so on, and vı = succ(z). 

The resulting algorithm is shown in Algorithm 3. The key procedure is GET- 
RooTz, which is called when adding a reserve edge (y,z) to the graph. In 
addition to all invariants from Algorithm 1, we maintain the following invari- 
ants for every unknown canonical state x, where jumps(z) is a list of states 
Vo, U1, 02,---,Uk- First jump: if the jump list is nonempty, then vo = succ(v). 
Reachability: vi}ı is reachable from v; for all i. The jump list also satisfies the 
following powers of two invariant: on the path of canonical states from vo to vj, 
the total number of states (including all states in each equivalence class) is at 
least 2’. While this invariant is not necessary for correctness, it is the key to the 
algorithm’s practical efficiency: it follows from this that if the jump list is fully 
saturated for every state, querying GETROOTz will take only logarithmic time. 
However, since jump lists are updated lazily, the jump list may not be saturated, 
so this does not establish a true asymptotic complexity for the algorithm. 


Theorem 4. Algorithm 3 is correct. 


Proof. The first jump and reachability invariants imply that v1,v2,... is some 
sublist of the states along the path from an unknown state to its root, potentially 
followed by some dead states. We need to argue that the subprocedure GETROOT 
(i) receives the same verdict as repeatedly calling succ to find a cycle in the first- 
cut algorithm and (ii) preserve both invariants. For first jump, if the jump list is 
empty, then GETROOT ensures that the first jump is set to the successor state. 
For reachability, popping dead states from the jump list clearly preserves the 
invariant, as does adding on a state along the path to the root, which is done 
when k’ > k. Merging states preserves both invariants trivially because we throw 
the jump list away, and marking states live preserves both invariants trivially 
since the jump list is only maintained and used for unknown states. 


4 Experimental Evaluation 


The primary goal of our evaluation has been to experimentally validate the 
performance of GIDs as a data structure in isolation, rather than their use in a 
particular application. Our evaluation seeks to answer the following questions: 
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Q1 How does our approach (Algorithms 2 and 3) compare to the state-of-the-art 
approach based on maintaining SCCs? 

Q2 How does the performance of the studied algorithms vary when the class of 
input graphs changes (e.g., sparse vs. dense, structured vs. random)? 

Q3 Finally, how do the studied algorithms perform on GIDs taken from the 
example application to regexes described in Sect. 5? 


To answer Q1, we put substantial implementation effort into a com- 
mon framework on which a fair comparison could be made between different 
approaches. To this end, we implemented GIDs as a data structure in Rust 
which includes a graph data structure on top of which all algorithms are built. In 
particular, this equalizes performance across algorithms for the following base- 
line operations: state and edge addition and retrieval, DFS and BFS search, 
edge iteration, and state merging. We chose Rust for our implementation for 
its performance, and because there does not appear to be an existing publicly 
available implementation of BFGT in any other language. The number of lines 
of code used to implement these various structures is summarized in Fig. 4. We 
implement Algorithms 2 and 3 and compare them with the following baselines: 


BFGT The state-of-the-art approach based on SCC maintenance, using worst- 
case amortized O(,/m) time per update [10]. 

Simple A simpler version of BFGT that uses a forward-DFS to search for cycles. 
Like Algorithm 1, it can take O(m?) in the worst case. 

Naïve A greedy upper bound for all approaches which re-computes the entire 
set of dead states using a linear-time DFS after each update. 


To answer Q2, first, we compiled a range of basic graph classes which are 
designed to expose edge case behavior in the algorithms, as well as randomly 
generated graphs. We focus on graphs with no live states, as live states are 
treated similarly by all algorithms. Most of the generated graphs come in 2 x 2 = 
4 variants: (i) the states are either read in a forwards- or backwards- order; and 
(ii) they are either dead graphs, where there are no open states at the end and so 
everything gets marked dead; or unknown graphs, where there is a single open 
state at the end, so most states are unknown. In the unknown case, it is sufficient 
to have one open state at the end, as many open states can be reduced to the 
case of a single open state where all edges point to that one. We include GIDs 
from line graphs and cycle graphs (up to 100K states in multiples of 3); complete 
and complete acyclic graphs (up to 1K states); and bipartite graphs (up to 1K 
states). These are important cases, for example, because the reverse-order line 
and cycle graphs are a potential worst case for Simple and BFGT. 

Second, to exhibit more dynamic behavior, we generated random graphs: 
sparse graphs with a fixed out-degree from each state, chosen from 1,2,3, or 
10 (up to 100K states); and dense graphs with a fixed probability of each edge, 
chosen from .01, .02, or .03 (up to 10K states). Each case uses 10 different random 
seeds. As with the basic graphs, states are read in some order and marked closed. 


€ That is, BFGT for SCC maintenance. BFGT for cycle detection has been imple- 
mented before, for instance, in [28] and formally verified in [32]. 
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Implementation LoC Category Benchmark Source Qty 
Compönent Basic Line 24 
Common Framework 1197 Cycle 24 
Naïve Algorithm 78 Complete 18 
Simple Algorithm 98 Bipartite 14 
BFGT Algorithm 265 Total 80 
Algorithm 2 (Ours) 253 Random Sparse 260 
Algorithm 3 (Ours) 283 i 
Dense 130 
Euler Tour Trees 1510 Total 390 
Experimental Scripts 556 
Separated Unit Tests 800 Regex RegExLib [15] 2061 37 
Utility 217 Handwritten [61] 70 26 
Other 69 Additional 11 


Total 5326 Total 74 


Fig. 4. Left: Lines of code for each algorithm and other implementation components. 
Right: Benchmark GIDs used in our evaluation. Where present, the source column 
indicates the quantity prior to filtering out trivially small graphs. 


To answer Q3, we wrote a backend to extract a GID at runtime from 
Z3’s regex solver [61]. While the backend of the solver is precisely a GID — 
and so could be passed to our GID implementation dynamically — this setup 
includes many extraneous overheads, including rewriting expressions and com- 
puting derivatives when adding nodes to the graph. While some of these over- 
heads may be possible to eliminate, and we are fairly confident that GIDs would 
be a bottleneck for sufficiently large input examples, this makes it difficult to 
isolate the performance impact of the GID data structure itself, which is the 
sole focus of this paper. We therefore instrumented the Z3 solver code to export 
the (incremental) sequence of graph updates that would be performed during a 
run of Z3 on existing regex benchmarks. For each benchmark, this instrumented 
code produces a faithful representation of the sequence of graph updates that 
actually occur in a run of the SMT solver on this particular benchmark. For 
each regex benchmark, we thus get a GID benchmark for the present paper. 
The benchmarks focus on extended regexes, rather than plain classical regexes 
as these are the ones for which dead state detection is relevant (see Sect.5). We 
include GIDs for the RegExLib benchmarks [15] and the handcrafted Boolean 
benchmarks reported in [61]. We add to these 11 additional examples designed 
to be difficult GID cases. The collection of regex benchmarks we used (just 
described) is available on GitHub.” 

From both the Q2 and Q3 benchmarks, we filter out any benchmark which 
takes under 10 milliseconds for all of the algorithms to solve (including Naive), 
and we use a 60 second timeout. The evaluation was run on a 2020 MacBook 
Air (MacOS Monterey) with an Apple M1 processor and 8GB of memory. 


T https: //github.com/cdstanford/regex-smt-benchmarks. 
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Fig. 5. Evaluation results. Left: Cumulative plot showing the number of benchmarks 
solved in time t or less for basic GID classes (top), randomly generated GIDs (middle), 
and regex-derived GIDs (bottom). Top right: Scatter plot showing the size of each 
benchmark vs time to solve. Bottom right: Average time to solve benchmarks of size 
closest to s, where values of s are chosen in increments of 1/3 on a log scale. 


Correctness. To ensure that all of our implementations our correct, we invested 
time into unit testing and checked output correctness on all of our collected 
benchmarks, including several cases which exposed bugs in previous versions 
of one or more algorithms. In total, all algorithms are vetted against 25 unit 
tests from handwritten edge cases that exposed prior bugs, 373 unit tests from 
benchmarks, and 30 module-level unit tests for specific functions. 


Results. Figure5 shows the results. Algorithm 3 shows significant improve- 
ments over the state-of-the-art, solving more benchmarks in a smaller amount 
of time across basic GIDs, random GIDs, and regex GIDs. Algorithm 2 also 
shows state-of-the-art performance, similar to BFGT on basic and regex GIDs 
and significantly better on random GIDs. On the bottom right, since looking at 
average time is not meaningful for benchmarks of widely varying size, we strat- 
ify the size of benchmarks into buckets, and plot time-to-solve as a function of 
size. Both x-axis and y-axis are on a log scale. The plot shows that Algorithm 
3 exhibits up to two orders of magnitude speedup over BFGT for larger GIDs — 
we see speedups of 110x to 530x for GIDs in the top five size buckets (GIDs of 
size nearest to 100K, ~200K, ~500K, 1M, and ~2M). 
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New Implementations of Existing Work. Our implementation contributes, 
to our knowledge, the first implementation of BFGT specifically for SCC main- 
tenance. In addition, it is one of the first implementations of Euler Tour Trees 
(see [7] for another), including the AVL tree backing for tours, and likely the 
first implementation in Rust. 


5 Application to Extended Regular Expressions 


In this section, we explain how precisely the GID state classification problem 
arises in the context of derivative-based solvers [45,61]. We first define extended 
regexes [31] (regexes extended with intersection & and complement ~) modulo a 
symbolic alphabet A of predicates that represent sets of characters. We explain 
the main idea behind symbolic derivatives, as found in [61]; these generalize Brzo- 
zowski [18] and Antimirov derivatives [5] (see also [19,42] for other proposals). 
Symbolic derivatives provide the foundation for incrementally creating a GID. 
Then we show, through an example, how a solver can incrementally expand 
derivatives to reduce the satisfiability problem to the GID state classification 
problem (Definition 2). 

Define a regex by the following grammar, where y € A denotes a predicate: 


RE:= ọ | e | RE,- RE, | RE* | RE, | RE, | RE,& RE, | “RE 


Let RF represent the concatenation of R k times. The symbolic derivative of a 
regex R, denoted 6(R), is a regex which describes the set of suffixes of strings in 
R after the first character is removed. The formal definition can be found in [61] 
and in the arXiv version of the present paper [60]. 

To apply Definition 1 to regexes: states are regexes; edges are transitions 
from a regex to its derivatives; and terminal states are the so-called nullable 
regexes, where a regex is nullable if it matches the empty string. Nullability can 
be computed inductively over the structure of regexes: for example, € and R* 
are nullable, and R, & Rə is nullable iff both Rı and Rə are nullable. A live 
state here is thus a regex that reaches a nullable regex via 0 or more edges. 
This implies that there exists a concrete string matching it. Conversely, dead 
states are always empty, i.e. they match no strings, but can reach other dead 
states, creating strongly connected components of closed states none of which 
are live. For example, the false predicate L of A serves as the regex that matches 
nothing and is trivially a dead state. Thus ~L is equivalent to .*, where . is the 
true predicate and is trivially a live state. 


5.1 Reduction from Incremental Regex Emptiness to GIDs 


For simplicity, suppose we want to determine the satisfiability of a single regex 
constraint s € R, where s is a string variable and R is a concrete regex. (This is 
not overly restrictive — any number of simultaneous regex constraints for a string 
s can be combined into single regex constraint by using the Boolean operations 
of regexes.) For example, let L = ~(.*a.1°°) and R = L&(.a), where a is the 
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“is digit” predicate that is true of characters that are digits (often denoted \d). 
The solver manipulates regex membership constraints on strings by unfolding 
them [61]. The constraint s € R, that essentially tests nonemptiness of R with 
s as a witness, becomes 


(s = e A Nullable(R)) V (s EA s1.. € 65,(R)) 
where, s Æ € since R is not nullable, s;., is the suffix of s from index i, and 
6(R) = 6(L) @ ô(.a) = (a? L& ~ (1): L) @a = (a? La~ (1) Ba: L&a) 


Let Ry = L&~(.1°°)&a and Ro = L&a. So R has two outgoing transitions 
R&R, and R&R} that contribute the edges (R, R1) and (R, R2) into the 
GID. Note that these edges depend only on R and not on sọ. 

We continue the search incrementally by checking the two branches of the 
if-then-else constraint, where Rı and Rə are again not nullable (so s1.. Æ €): 


So EAN 82.. € Os, (Ri) V So Ema A 82. € Os, (R2) 
6(R1) = (a? La~ (21) & ~(.99) s La~ (.))® (a7 e: 1) = (a7e: 1) 
6(Re) = (a? L&~(.2):L) @ (a?e:L) =(a?e: 1) 


It follows that Rı =e and Rp ><, so the edges (Rı,£) and (Re, <) are added to 
the GID where e is a trivial terminal state. In fact, after Rı the search already 
terminates because we then have the path (R, Rı)(Rı,€) that implies that R is 
live. The associated constraints so € œ and sı € a and the final constraint that 
S2., = € can be used to extract a concrete witness, e.g., s = ‘‘42". 

Soundness of the algorithm follows from that if R is nonempty (s € R is 
satisfiable), then we eventually arrive at a nullable (terminal) regex, as in the 
example run above. To achieve completeness — and to eliminate dead states as 
early as possible — we incrementally construct a GID corresponding to the set 
of regexes seen so far (as above). After all the feasible transitions from R to 
its derivatives in 6(R) are added to the GID as edges (WLOG in one batch), 
the state R becomes closed. Crucially, due to the symbolic form of 6(R), no 
derivative is missing. Therefore R is known to be empty precisely as soon as R is 
detected as a dead state in the GID. An additional benefit is that the algorithm 
is independent of the size of the universe of A, that may be very large (e.g. the 
Unicode character set), or even infinite. We get the following theorem that uses 
finiteness of the closure of symbolic derivatives [61, Theorem 7.1]: 


Theorem 5. For any regex R, (1) If R is nonempty, then the decision procedure 
eventually marks R live. (2) If R is empty, then the decision procedure marks R 
dead at the earliest stage that it is know to be dead, and terminates. 


6 Related Work 


Online Graph Algorithms. Online graph algorithms are typically divided into 
problems over incremental graphs (where edges are added), decremental graphs 
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(where edges are deleted), and dynamic graphs (where edges are both added 
and deleted), with core data structures discussed in [27,49]. Important prob- 
lems include transitive closure, cycle detection, topological ordering, and strongly 
connected component (SCC) maintenance. 

For incremental topological ordering, [46] is an early work, and [33] presents 
two different algorithms, one for sparse graphs and one for dense graphs — the 
algorithms are also extended to work with SCCs. The sparse algorithm was sub- 
sequently simplified in [10] and is the basis of our implementation named BFGT 
in Sect.4. A unified approach of several algorithms based on [10] is presented 
in [21] that uses a notion of weak topological order and a labeling technique that 
estimates transitive closure size. Further extensions of [10] are studied in [11,14] 
based on randomization. 

For dynamic directed graphs, a topological sorting algorithm that is experi- 
mentally preferable for sparse graphs is discussed in [56], and a related article [55] 
discusses strongly connected components maintenance. Transitive closure for 
dynamic graphs is studied in [57], improving upon some algorithms presented 
earlier in [34]. One major application for these algorithms is in pointer analy- 
sis [54]. 

For undirected forests, fully dynamic reachability is solvable in amortized 
logarithmic time per edge via multiple possible approaches [3,30,35, 59,64]; our 
implementation uses Euler Tour Trees [35]. 


Data Structures for SMT. UnionFind [63] is a foundational data structure 
used in SMT. E-graphs [23,67] are used to ensure functional extensionality, where 
two expressions are equivalent if their subexpressions are equivalent [25,52]. In 
both UnionFind and E-graphs, the maintained relation is an equivalence rela- 
tion. In contrast, maintaining live and dead states involves tracking reachability 
rather than equivalence. To the best of our knowledge, the specific formulation 
of incremental reachability we consider here is new. 


Dead State Elimination in Automata. A DFA or NFA may be viewed as a 
GID, so state classification in GIDs solves dead state elimination in DFAs and 
NFAs, while additionally working in an incremental fashion. Dead state elimi- 
nation is also known as trimming [37] and plays an important role in automata 
minimization [12,38,48]. The literature on minimization is vast, and goes back 
to the 1950s [16,17,39-41,50,53]; see [65] for a taxonomy, [2] for an experimen- 
tal comparison, and [22] for the symbolic case. Watson et. al. [66] propose an 
incremental minimization algorithm, in the sense that it can be halted at any 
point to produce a partially minimized, equivalent DFA; unlike in our setting, 
the DFA’s states and transitions are fixed and read in a predetermined order. 
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Abstract. Many parallel programming models guarantee that if all 
sequentially consistent (SC) executions of a program are free of data 
races, then all executions of the program will appear to be sequen- 
tially consistent. This greatly simplifies reasoning about the program, 
but leaves open the question of how to verify that all SC executions 
are race-free. In this paper, we show that with a few simple modifica- 
tions, model checking can be an effective tool for verifying race-freedom. 
We explore this technique on a suite of C programs parallelized with 
OpenMP. 


Keywords: data race - model checking - OpenMP 


1 Introduction 


Every multithreaded programming language requires a memory model to specify 
the values a thread may obtain when reading a variable. The simplest such 
model is sequential consistency |22]. In this model, an execution is an interleaved 
sequence of the execution steps from each thread. The value read at any point 
is the last value that was written to the variable in this sequence. 

There is no known efficient way to implement a full sequentially consistent 
model. One reason for this is that many standard compiler optimizations are 
invalid under this model. Because of this, most multithreaded programming lan- 
guages (including language extensions) impose a requirement that programs do 
not have data races. A data race occurs when two threads access the same vari- 
able without appropriate synchronization, and at least one access is a write. 
(The notion of appropriate synchronization depends on the specific language.) 
For data race-free programs, most standard compiler optimizations remain valid. 
The Pthreads library is a typical example, in that programs with data races 
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have no defined behavior, but race-free programs are guaranteed to behave in a 
sequentially consistent manner [25]. 

Modern languages use more complex “relaxed” memory models. In this model, 
an execution is not a single sequence, but a set of events together with various 
relations on those events. These relations—e.g., sequenced before, modification 
order, synchronizes with, dependency-ordered before, happens before [21|—must 
satisfy a set of complex constraints spelled out in the language specification. The 
complexity of these models is such that only the most sophisticated users can 
be expected to understand and apply them correctly. Fortunately, these models 
usually provide an escape, in the form of a substantial and useful language subset 
which is guaranteed to behave sequentially consistently, as long as the program 
is race-free. Examples include Java [23], C and C++ since their 2011 versions 
(see [8] and [21, §5.1.2.4 Note 19]), and OpenMP [26, §1.4.6]. 

The “guarantee” mentioned above actually consists of two parts: (1) all exe- 
cutions of data race-free programs in the language subset are sequentially con- 
sistent, and (2) if a program in the language subset has a data race, then it has 
a sequentially consistent execution with a data race [8]. Putting these together, 
we have, for any program P in the language subset: 


(SC4DRF) If all sequentially consistent executions of P are data 
race-free, then all executions of P are sequentially consistent. 


The consequence of this is that the programmer need only understand sequen- 
tially consistent semantics, both when trying to ensure P is race-free, and when 
reasoning about other aspects of the correctness of P. This approach provides 
an effective compromise between usability and efficient implementation. 

Still, it is the programmer’s responsibility to ensure that all sequentially 
consistent executions of the program are race-free. Unfortunately, this problem 
is undecidable [4], so no completely algorithmic solution exists. As a practical 
matter, detecting and eliminating races is considered one of the most challeng- 
ing aspects of parallel program development. One source of difficulty is that 
compilers may “miscompile” racy programs, i.e., translate them in unintuitive, 
non-semantics-preserving ways [7]. After all, if the source program has a race, 
the language standard imposes no constraints, so any output from the compiler 
is technically correct. 

Researchers have explored various techniques for race checking. Dynamic 
analysis tools (e.g., [18]) have experienced the most uptake. These techniques 
can analyze a single execution precisely, and report whether a race occurred, 
and sometimes can draw conclusions about closely related executions. But the 
behavior of many concurrent programs depends on the program input, or on 
specific thread interleavings, and dynamic techniques cannot explore all possible 
behaviors. Moreover, dynamic techniques necessarily analyze the behavior of 
the executable code that results from compilation. As explained above, racy 
programs may be miscompiled, even possibly removing the race, in which case 
a dynamic analysis is of limited use. 

Approaches based on static analysis, in contrast, have the potential to verify 
race-freedom. This is extremely challenging, though some promising research 
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prototypes have been developed (e.g., [10]). The most significant limitation is 
imprecision: a tool may report that race-free code has a possible race— a “false 
alarm”. Some static approaches are also not sound, i.e., they may fail to detect 
a race in a racy program; like dynamic tools, these approaches are used more as 
bug hunters than verifiers. 

Finite-state model checking [15] offers an interesting compromise. This app- 
roach requires a finite-state model of the program, which is usually achieved 
by placing small bounds on the number of threads, the size of inputs, or other 
program parameters. The reachable states of the model can be explored through 
explicit enumeration or other means. This can be used to implement a sound and 
precise race analysis of the model. If a race is found, detailed information can 
be produced, such as a program trace highlighting the two conflicting memory 
accesses. Of course, if the analysis concludes the model is race-free, it is still pos- 
sible that a race exists for larger parameter values. In this case, one can increase 
those values and re-run the analysis until time or computational resources are 
exhausted. If one accepts the “small scope hypothesis’—the claim that most 
defects manifest in small configurations of a system—then model checking can 
at least provide strong evidence for the absence of data races. In any case, the 
results provide specific information on the scope that is guaranteed to be race- 
free, which can be used to guide testing or further analysis. 

The main limitation of model checking is state explosion, and one of the 
most effective techniques for limiting state explosion is partial order reduction 
(POR) [17]. A typical POR technique is based on the following observation: 
from a state s at which a thread ¢ is at a “local” statement—i.e., one which 
commutes with all statements from other threads—then it is often not necessary 
to explore all enabled transitions from s; instead, the search can explore only 
the enabled transitions from t. Usually local statements are those that access 
only thread-local variables. But if the program is known to be race-free, shared 
variable accesses can also be considered “local” for POR. This is the essential 
observation at the heart of recent work on POR in the verification of Pthreads 
programs [29]. 

In this paper, we explore a new model checking technique that can be used 
to verify race-freedom, as well as other correctness properties, for programs in 
which threads synchronize through locks and barriers. The approach requires 
two simple modifications to the standard state reachability algorithm. First, 
each thread maintains a history of the memory locations accessed since its last 
synchronization operation. These sets are examined for races and emptied at 
specific synchronization points. Second, a novel POR is used in which only lock 
(release and acquire) operations are considered non-local. In Sect. 2, we present 
a precise mathematical formulation of the technique and a theorem that it has 
the claimed properties, including that it is sound and precise for verification of 
race-freedom of finite-state models. 

Using the CIVL symbolic execution and model checking platform [31], we 
have implemented a prototype tool, based on the new technique, for verify- 
ing race-freedom in C/OpenMP programs. OpenMP is an increasingly popular 
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directive-based language for writing multithreaded programs in C, C++, or For- 
tran. A large sub-language of OpenMP has the SC4DRF guarantee. While 
the theoretical model deals with locks and barriers, it can be applied to many 
OpenMP constructs that can be modeled using those primitives, such as atomic 
operations and critical sections. This is explained in Sect. 3, along with the results 
of some experiments applying our tool to a suite of C/OpenMP programs. In 
Sect. 4, we discuss related work and Sect. 5 concludes. 


2 Theory 


We begin with a simple mathematical model of a multithreaded program that 
uses locks and barriers for synchronization. 


Definition 1. Let TID be a finite set of positive integers. A multithreaded pro- 
gram with thread ID set TID comprises 


1. a set Lock of locks 
2. a set Shared of shared states 
3. for each 2 € TID: 

(a) a set Local;, the local states of thread i, which is the union of five disjoint 
subsets, Acquire;, Release;, Barrier;, Nsync;, and Term; 

(b) a set Stmt; of statements, which includes the lock statements acquire, (1) 
and release;(/) (for | € Lock), and the barrier-exit statement exit;; all 
others statements are known as nsync (non-synchronization) statements 

(c) for each ø € Acquire; U Release; U Barrier;, a local state next(a) € Local; 

(d) for each ø € Acquire, U Release;, a lock lock(a) € Lock 

(e) for each o € Nsync,, a nonempty set stmts(o) C Stmt; of nsync statements 
and function 


update(c): stmts(o) x Shared — Local; x Shared. 


All of the sets Local; and Stmt; (i € TID) are pairwise disjoint. 


Each thread has a unique thread ID number, an element of TID. A local state 
for thread 7 encodes the values of all thread-local variables, including the program 
counter. A shared state encodes the values of all shared variables. (Locks are not 
considered shared variables.) A thread at an acquire state o is attempting to 
acquire the lock lock(c). At a release state, the thread is about to release a lock. 
At a barrier state, a thread is waiting inside a barrier. After executing one of 
the three operations, each thread moves to a unique next local state. A thread 
that reaches a terminal state has terminated. From an nsync state, any positive 
number of statements are enabled, and each of these statements may read and 
update the local state of the thread and/or the shared state. 


1 Any OpenMP program that does not use non-sequentially consistent atomic direc- 
tives, omp_test_lock, or omp_test_nest_lock [26, §1.4.6]. 
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For i € TID, the local graph of thread i is the directed graph with nodes 
Local; and an edge ø — o’ if either (i) o € Acquire, U Release; U Barrier; and 
o’ = next(c), or (ii) ø € Nsync; and there is some ¢’ € Shared such that (0’, ¢’) 
is in the image of update(c). 

Fix a multithreaded program P and let 


LockState = (Lock — {0} U TID) 


State = ( II Local ) x Shared x LockState x 27. 


icTID 


A lock state specifies the owner of each lock. The owner is a thread ID, or 0 if the 
lock is free. The elements of State are the (global) states of P. A state specifies 
a local state for each thread, a shared state, a lock state, and the set of threads 
that are currently blocked at a barrier. 

Let i € TID and L; = Local; x Shared x LockState x 2". Define 


enabled;: L; > 25*™% 
{acquire,(J)} ifø € Acquire; Al = lock(c) A A(1) = 0 
{release;(l)} if sigma € Release; A l = lock(c) A O(1) = i 


Ar 4 {exit;} if o € Barrier; Ai £ w 
stmts(o) if o € Nsync; 
0) otherwise. 


where A = (o, ¢,0,w) € Li. This function returns the set of statements that are 
enabled in thread 7 at a given state. This function does not depend on the local 
states of threads other than i, which is why those are excluded from L;. An 
acquire statement is enabled if the lock is free; a release is enabled if the calling 
thread owns the lock. A barrier exit is enabled if the thread is not currently in 
the barrier blocked set. 

Execution of an enabled statement in thread i updates the state as follows: 


execute;: {(A,t) € Li x Stmt; | t € enabled;(A)} > Li 


(0',¢,O[L = il, w) ifo € Acquire, A t = acquire, (l) A o’ = next(c) 

(a',¢,O[L + O},w’) if o € Release; ^t = release; (1) A o’ = next(o) 
(A, t) > (o', 0, w’) if o € Barrier; A t = exit; A o’ = next(c) 

(o', 6,8, w’) if o € Nsync; At € stmts(c) A 


update(o)(t,) = (0’,¢’) 
where A = (o, ¢,0, w) and in each case above 


wU {i} ifo’ € Barrier; A wU {i} A TID 
w = 4 if o’ € Barrier; A wU {i} = TID 


w otherwise. 
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Note a thread arriving at a barrier will have its ID added to the barrier blocked 
set, unless it is the last thread to arrive, in which case all threads are released 
from the barrier. 

At a given state, the set of enabled statements is the union over all threads 
of the enabled statements in that thread. Execution of a statement updates the 
state as above, leaving the local states of other threads untouched: 


enabled: State — 25t™* 
she U enabled, (&),¢, 0, w) 
j€TID 
execute: {(s,t) € State x Stmt | t € enabled(s)} — State 


(s,t) > (Eli = 9], 67,0, w’), 


where s = (€,¢,0,w) € State, t € enabled(s), i = tid(t), and 
execute; (£; C, 0, w, t) = (0, 0,0", w). 


Definition 2. A transition is a triple s 4 s’, where s € State, t € enabled(s), 


and s’ = execute(s,t). An execution a of P is a (finite or infinite) chain of 


transitions sọ + sı  ---. The length of a, denoted la|, is the number of 


transitions in a. 


Note that an execution is completely determined by its initial state so and its 
statement sequence titz. 

Having specified the semantics of the computational model, we now turn to 
the concept of the data race. The traditional definition requires the notion of 
“conflicting” accesses: two accesses to the same memory location conflict when 
at least one is a write. The following abstracts this notion: 


Definition 3. A symmetric binary relation conflict on Stmt is a conflict relation 
for P if the following hold for all t1, t2 € Stmt: 


1. if (t1,t2) € conflict then tı and tz are nsync statements from different threads 
2. if tı and tg are nsync statements from different threads and (t1, t2) Z conflict, 
then for all s € State, if t1, t2 € enabled(s) then 


execute(execute(s, t1), t2) = execute(execute(s, t2), t1). 


Fix a conflict relation for P for the remainder of this section. 

The next ingredient in the definition of data race is the happens-before rela- 
tion. This is a relation on the set of events generated by an execution. An event 
is an element of Event = Stmt x N. 


Definition 4. Let a = (so Moia ---) be an execution. The trace of a is 
the sequence of events tr(a@) = (t1,71)(t2,n2)--+-, of length |a|, where n; is the 


number of j € [1,i] for which tid(t;) = tid(t;). We write [a] for the set of events 
occurring in tr(a). 
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A trace labels the statements executed by a thread with consecutive integers 
starting from 1. Note the cardinality of [a] is |a|, as no two events in tr(a) 
are equal. Also, [a] is invariant under transposition of two adjacent commuting 
transitions from different threads. 

Given an execution a, the happens-before relation of a, denoted HB(a), is a 
binary relation on [a]. It is the transitive closure of the union of three relations: 


1. the intra-thread order relation 
{((t1, m1), (t2, n2)) E [a] x [a] | tid(t1) = tid(t2) A nı < ng}. 


2. the release-acquire relation. Say tr(a) = eje2... and e; = (ti, ni). Then (e;, e;) 
is in the release-acquire relation if there is some l € Lock such that all of the 
following hold: (i) 1 <i <j < |a|, (ii) t; is a release statement on J, (iii) t; is 
an acquire statement on l, and (iv) whenever i < k < j, tk is not an acquire 
statement on l. 

3. the barrier relation. For any e = (t,n) € [a], let i = tid(t) and define 


epoch(e) = |{e’ € [a] | e’ = (exit;, 7) for some j € [1, n}}|, 


the number of barrier exit events in thread 7 preceding or including e. The 
barrier relation is 


{(e,e’) € [a] x [a] | epoch(e) < epoch(e’)}. 
Two events “race” when they conflict but are not ordered by happens-before: 


Definition 5. Let a be an execution and e,e’ € [a]. Say e = (t,n) and e’ = 
(t',n'). We say e and e’ race in a if (t, t’) € conflict and neither (e, e’) nor (e’, e) 
is in HB(a). The data race relation of a is the symmetric binary relation on [a 

DR(a) = {(e,e’) € [a] x [a] | e and e’ race in a}. 


Now we turn to the problem of detecting data races. Our approach is to 
explore a modified state space. The usual state space is a directed graph with 
node set State and transitions for edges. We make two modifications. First, 
we add some “history” to the state. Specifically, each thread records the nsync 
statements it has executed since its last lock event or barrier exit. This set is 
checked against those of other threads for conflicts, just before it is emptied after 
its next lock event or barrier exit. The second change is a reduction: any state 
that has an enabled statement that is not a lock statement will have outgoing 
edges from only one thread in the modified graph. 

A well-known technical challenge with partial order reduction concerns cycles 
in the reduced state space. We deal with this challenge by assuming that P comes 
with some additional information. Specifically, for each i, we are given a set Ri, 
with Release; U Acquire; C R; C Local;, satisfying: any cycle in the local graph 
of thread 7 has at least one node in R;. In general, the smaller R;, the more 
effective the reduction. In many application domains, there are no cycles in the 
local graphs, so one can take R; = Release; UAcquire,;. For example, standard for 
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loops in C, in which the loop variable is incremented by a fixed amount at each 
iteration, do not introduce cycles, because the loop variable will take on a new 
value at each iteration. For while loops, one may choose one node from the loop 
body to be in R;. Goto statements may also introduce cycles and could require 
additions to Ri. 


Definition 6. The race-detecting state graph for P is the pair G = (V, E), where 


V = State x ( II oe) 
i€TID 


and E C V x Stmt x V consists of all ((s,a),t, (s’,a’)) such that, letting o; be 
the local state of thread 7 in s, 


foe PEER 
1. s — s’ is a transition in P 
a; U {t} iftis an nsync statement in thread i 
2. Vi € TID, a; = 4 if t = exito or i = tid(t) A c; € Ri 
otherwise 
3. if there is some i € TID such that o; R; and thread i has an enabled 
statement at s, then tid(t) is the minimal such i. 


The race-detecting state graph may be thought of as a directed graph in which 
the nodes are V and edges are labeled by statements. Note that at a state where 
all threads are in the barrier, exitg is the only enabled statement in the race- 
detecting state graph, and its execution results in emptying all the a;. A lock 
event in thread 7 results in emptying a; only. 


Definition 7. Let P be a multithreaded program and G = (V, E) the race- 
detecting state graph for P. 


1. Let u = (s,a) € V and i € TID. We say thread i detects a race in u if there 
exist j € TID \ {i}, tı € aj, and t2 € a; such that (t1, t2) € conflict. 

2. Le e =v bv E E, i = tid(t), o the local state of thread i at v, and o’ the 
local state of thread i at v’. We say e detects a race if either (i) o € R;\Acquire; 
and thread i detects a race in v, (ii) o’ € Acquire; and thread i detects a race 
in v’, or (ii) t = exito and any thread detects a race in v. 

3. We say G detects a race from u if E contains an edge that is reachable from 
u and detects a race, or there is some v = (s,a) € V that is reachable from 
u, and i € TID, such that enabled(s) = Ø and thread i detects a race in v. 


Definition 7 suggests a method for detecting data races in a multithreaded 
program. The nodes and edges of the race-detecting state graph reachable from 
an initial node are explored. (The order in which they are explored is irrelevant.) 
When an edge from a thread at an R; \ Acquire; state is executed, the elements 
of a; are compared with those in a; for all j € TID \ {7} to see if a conflict exists, 
and if so, a data race is reported. When an edge in thread 7 terminates at an 
Acquire, state, a similar race check takes place. When an exito occurs, or a node 
with no outgoing edges is reached, a; and a; are compared for all 7,7 € TID with 
i # j. This approach is sound and precise in the following sense: 
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Theorem 1. Let P be a multithreaded program, and G = (V,E) the race- 
detecting state graph for P. Let so € State and let uo = (89,07) € V. Assume 
the set of nodes reachable from ug is finite. Then 


1. P has an execution from so with a data race if, and only if, G detects a race 
from uo. 

2. If there is a data race-free execution of P from so to some state sp with 
enabled(s) = @ then there is a path in G from uo to a node with state com- 
ponent sp. 


A proof of Theorem 1 is given in https://arxiv.org/abs/2305.18198. 
Example 1. Consider the 2-threaded program represented in pseudocode: 


ty: acquire(l,); x=1; release(l,); 


t2: acquire(l2); x=2; release(l2); 


where lı and ly are distinct locks. Let R; = Release; U Acquire; (i = 1,2). One 
path in the race-detecting state graph G executes as follows: 


acquire(i;); x=1; release(/1); acquire(l2); x=2; release(l2) ;. 


A data race occurs on this path since the two assignments conflict but are not 
ordered by happens-before. The race is not detected, since at each lock operation, 
the statement set in the other thread is empty. However, there is another path 


acquire(1;); x=1; acquire(l2); x=2; release(/1); 


in G, and on this path the race is detected at the release. 


3 Implementation and Evaluation 


We implemented a verification tool for C/OpenMP programs using the CIVL 
symbolic execution and model checking framework. This tool can be used to ver- 
ify absence of data races within bounds on certain program parameters, such as 
input sizes and the number of threads. (Bounds are necessary so that the num- 
ber of states is finite.) The tool accepts a C/OpenMP program and transforms 
it into CIVL-C, the intermediate verification language of CIVL. The CIVL-C 
program has a state space similar to the race-detecting state graph described 
in Sect. 2. The standard CIVL verifier, which uses model checking and symbolic 
execution techniques, is applied to the transformed code and reports whether 
the given program has a data race, and, if so, provides precise information on 
the variable involved in the race and an execution leading to the race. 

The approach is based on the theory of Sect. 2, but differs in some implemen- 
tation details. For example, in the theoretical approach, a thread records the set 
of non-synchronization statements executed since the thread’s last synchroniza- 
tion operation. This data is used only to determine whether a conflict took place 
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between two threads. Any type of data that can answer this question would 
work equally well. In our implementation, each thread instead records the set of 
memory locations read, and the set of memory locations modified, since the last 
synchronization. A conflict occurs if the read or write set of one thread intersects 
the write set of another read. As CIVL-C provides robust support for tracking 
memory accesses, this approach is relatively straightforward to implement by a 
program transformation. 

In Sect. 3.1, we summarize the basics of OpenMP. In Sect. 3.2, we provide the 
necessary background on CIVL-C and the primitives used in the transformation. 
In Sect. 3.3, we describe the transformation itself. In Sect. 3.4, we report the 
results of experiments using this tool. 

All software and other artifacts necessary to reproduce the experiments, as 
well as the full results, are included in a VirtualBox virtual machine available at 
https: //doi.org/10.5281 /zenodo.7978348. 


3.1 Background on OpenMP 


OpenMP is a pragma-based language for parallelizing programs written in C, 
C++ and Fortran [13]. OpenMP was originally designed and is still most com- 
monly used for shared-memory parallelization on CPUs, although the language 
is evolving and supports an increasing number of parallelization styles and hard- 
ware targets. We introduce here the OpenMP features that are currently sup- 
ported by our implementation in CIVL. An example that uses many of these 
features is shown in Fig. 1. 

The parallel construct declares the following structured block as a parallel 
region, which will be executed by all threads concurrently. Within such a parallel 
region, programmers can use worksharing constructs that cause certain parts of 
the code to be executed only by a subset of threads. Perhaps most importantly, 
the loop worksharing construct can be used inside a parallel region to declare 
a omp for loop whose iterations are mapped to different threads. The mapping 
of iterations to threads can be controlled through the schedule clause, which 
can take values including static, dynamic, guided along with an integer that 
defines the chunk size. If no schedule is explicitly specified, the OpenMP run time 
is allowed to use an arbitrary mapping. Furthermore, a structured block within 
a worksharing loop may be declared as ordered, which will cause this block 
to be executed sequentially in order of the iterations of the worksharing loop. 
Worksharing for non-iterative workloads is supported through the sections con- 
struct, which allows the programmer to define a number of different structured 
blocks of code that will be executed in parallel by different threads. 

Programmers may use pragmas and clauses for barriers, atomic updates, 
and locks. OpenMP supports named critical sections, allowing no more than 
one thread at a time to enter a critical section with that name, and unnamed 
critical sections that are associated with the same global mutex. OpenMP also 
offers master and single constructs that are executed only by the master thread 
or one arbitrary thread. 
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1 || #pragma omp parallel shared(b) private(i) shared(u,v) 

2 || { // parallel region: all threads will « ute this 

3 #pragma omp sections // section sharing construct 

4 

5 #pragma omp section // one thread will do this... 

6 {b=0; v=0; } 

T #pragma omp section // while another thread does this... 

8 u = rand(); 

9 } 

10 // lo aring construct partitions iterations by ch thread has a 
11 // pri opy of b; tł added back to or iginal nd of loop... 
12 #pragma omp for reduction(+:b) schedule (dyaanie, 1) 

13 for (i=0; i<10; i++) { 

14 b=b+t+i; 

15 #pragma omp atomic seq_cst // atomic update to v 

16 vt=i; 

17 #pragma omp critical (collatz) // one thread at a time enters critical section 
18 u = (u%2==0) ? u/2 : 3*u+1; 

19 } 

20 ||} 


Fig. 1. OpenMP Example 


Variables are shared by all threads by default. Programmers may change 
the default, as well as the scope of individual variables, for each parallel region 
using the following clauses: private causes each thread to have its own vari- 
able instance, which is uninitialized at the start of the parallel region and sep- 
arate from the original variable that is visible outside the parallel region. The 
firstprivate scope declares a private variable that is initialized with the value 
of the original variable, whereas the lastprivate scope declares a private vari- 
able that is uninitialized, but whose final value is that of the logically last work- 
sharing loop iteration or lexically last section. The reduction clause initializes 
each instance to the neutral element, for example 0 for reduction (+). Instances 
are combined into the original variable in an implementation-defined order. 

CIVL can model OpenMP types and routines to query and control the num- 
ber of threads (omp_set_num_threads, omp_get_num_threads), get the cur- 
rent thread ID (omp_get_thread_num), interact with locks (omp_init_lock, 
omp_destroy_lock, omp_set_lock, omp_unset_lock, and obtain the current 
wall clock time (omp_get_wtime). 


3.2 Background on CIVL-C 


The CIVL framework includes a front-end for preprocessing, parsing, and build- 
ing an AST for a C program. It also provides an API for transforming the AST. 
We used this API to build a tool which consumes a C/OpenMP program and pro- 
duces a CIVL-C “model” of the program. The CIVL-C language includes most 
of sequential C, including functions, recursion, pointers, structs, and dynami- 
cally allocated memory. It adds nested function definitions and primitives for 
concurrency and verification. 
In CIVL-C, a thread is created by spawning a function: $spawn f(...); 

There is no special syntax for shared or thread-local variables; any variable that 
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is in scope for two threads is shared. CIVL-C uses an interleaving model of 
concurrency similar to the formal model of Sect. 2. Simple statements, such as 
assignments, execute in one atomic step. 

Threads can synchronize using guarded commands, which have the form 
$when (e)S. The first atomic substatement of S is guaranteed to execute only 
from a state in which e evaluates to true. For example, assume thread IDs are 
numbered from 0, and a lock value of —1 indicates the lock is free. The acquire 
lock operation may be implemented as $when (1<0) 1=tid;, where 1 is an inte- 
ger shared variable and tid is the thread ID. A release is simply 1=-1;. 

A convenient way to spawn a set of threads is $parfor (int i:d)S. This 
spawns one thread for each element of the 1d-domain d; each thread executes S 
with 7 bound to one element of the domain. A 1d-domain is just a set of integers; 
e.g., if a and b are integer expressions, the domain expression a. .b represents 
the set {a,a+1,...,b}. The thread that invokes the $parfor is blocked until 
all of the spawned threads terminate, at which point the spawned threads are 
destroyed and the original thread proceeds. 

CIVL-C provides primitives to constrain the interleaving semantics of a pro- 
gram. The program state has a single atomic lock, initially free. At any state, 
if there is a thread t that owns the atomic lock, only t is enabled. When the 
atomic lock is free, if there is some thread at a $local_start statement, and 
the first statement following $local_start is enabled, then among such threads, 
the thread with lowest ID is the only enabled thread; that thread executes 
$local_start and obtains the lock. When t invokes $local_end, t relinquishes 
the atomic lock. Intuitively, this specifies a block of code to be executed atomi- 
cally by one thread, and also declares that the block should be treated as a local 
statement, in the sense that it is not necessary to explore all interleavings from 
the state where the local is enabled. 

Local blocks can also be broken up at specified points using function $yield. 
If t owns the atomic lock and calls $yield, then t relinquishes the lock and does 
not immediately return from the call. When the atomic lock is free, there is no 
thread at a $local_start, a thread t is in a $yield, and the first statement 
following the $yield is enabled, then t may return from the $yield call and 
re-obtain the atomic lock. This mechanism can be used to implement the race- 
detecting state graph: thread 7 begins with $local_start, yields at each R; 
node, and ends with $local_end. 

CIVL’s standard library provides a number of additional primitives. For 
example, the concurrency library provides a barrier implementation through a 
type $barrier, and functions to initialize, destroy, and invoke the barrier. 

The mem library provides primitives for tracking the sets of memory locations 
(a variable, an element of an array, field of a struct, etc.) read or modified 
through a region of code. The type $mem is an abstraction representing a set 
of memory locations, or mem-set. The state of a CIVL-C thread includes a 
stack of mem-sets for writes and a stack for reads. Both stacks are initially 
empty. The function $write_set_push pushes a new empty mem-set onto the 
write stack. At any point when a memory location is modified, the location is 
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1 int nthreads = ...; 
2 || $mem reads[nthreads], writes[nthreads] ; 
3 || void check_conflict(int i, int j) { 
4 $assert ($mem_disjoint(reads[i], writes[j]) && $mem_disjoint(writes[i], reads[j]) && 
5 $mem_disjoint(writes[i], writes[j])); 
6 ||} 
7 || void check_and_clear_all() { 
8 for (int i=0; i<nthreads; i++) 
9 for (int j=i+1; j<nthreads; j++) check_conflict(i, j); 
10 for (int i=0; i<nthreads; i++) reads[i] = writes[i] = $mem_empty(); 
11 ||} 
12 || void run(int tid) { 
13 void pop() { reads[tid]=$read_set_pop(); writes[tid]=$write_set_pop(); } 
14 void push() { $read_set_push(); $write_set_push(); } 
15 void check() { 
16 for (int i=0; i<nthreads; i++) { if (i==tid) continue; check_conflict(tid, i); } 
17 F 
18 // local variable declarations 
19 $local_start(); push(); S pop(); $local_end(); 
20 ||} 
21 || for (int i=0; i<nthreads; i++) reads[i] = writes[i] = $mem_empty(); 
22 || $parfor (int tid:0..nthreads-1) run(tid); 
23 || check_and_clear_all(); 


Fig. 2. Translation of #pragma omp parallel S 


added to the top entry on the write stack. Function $write_set_pop pops the 
write stack, returning the top mem-set. The corresponding functions for the 
read stack are $read_set_push and $read_set_pop. The library also provides 
various operations on mem-sets, such as $mem_disjoint, which consumes two 
mem-sets and returns true if the intersection of the two mem-sets is empty. 


3.3 Transformation for Data Race Detection 


The basic structure for the transformation of a parallel construct is shown in 
Fig. 2. The user specifies on the command line the default number of threads to 
use in a parallel region. After this, two shared arrays are allocated, one to record 
the read set for each thread, and the other the write set. Rather than updating 
these arrays immediately with each read and write event, a thread updates them 
only at specific points, in such a way that the shared sets are current whenever 
a data race check is performed. 

The auxiliary function check_conflict asserts no read-write or write-write 
conflict exists between threads 7 and j. Function check_and_clear_all checks 
that no conflict exists between any two threads and clears the shared mem-sets. 

Each thread executes function run. A local copy of each private variable is 
declared (and, for firstprivate variables, initialized) here. The body of this 
function is enclosed in a local region. The thread begins by pushing new entries 
onto its read and write stacks. As explained in Sect.3.2, this turns on memory 
access tracking. The body S is transformed in several ways. First, references to 
the private variable are replaced by references to the local copy. Other OpenMP 
constructs are translated as follows. 
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Lock operations. Several OpenMP operations are modeled using locks. The 
omp_set_lock and omp_unset_lock functions are the obvious examples, but we 
also use locks to model the behavior of atomic and critical section constructs. In 
any case, a lock acquire operation is translated to 


pop(); check(); $yieldQ; acquire(1); push(); 


The thread first pops its stacks, updating its shared mem-sets. At this point, the 
shared structures are up-to-date, and the thread uses them to check for conflicts 
with other threads. This conforms with Definition 7(2), that a race check occur 
upon arrival at an acquire location. It then yields to other threads as it attempts 
to acquire lock l. Once acquired, it pushes new empty entries onto its stack and 
resumes tracking. A release statement becomes 


pop); $yieldQ; check(); release(1); push(); 


It is similar to the acquire case, except that the check occurs upon leaving the 
release location, i.e., after the yield. A similar sequence is inserted in any loop 
(e.g., a while loop or a for loop not in standard form) that may create a cycle 
in the local space, only without the release statement. 


Barriers. An explicit or implicit barrier in S becomes 


pop(); $local_end(); $barrier_call(); if (tid==0) check_and_clear_all1() ; 
$barrier_call(); $local_start(); push();. 


The CIVL-C $barrier_call1 function must be invoked outside of a local region, 
as it may block. Once all threads are in the barrier, a single thread (0) checks 
for conflicts and clears all the shared mem-sets. A second barrier call is used 
to prevent other threads from racing ahead before this check and clear is com- 
plete. This protocol mimics the events that take place atomically with an exitg 
transition in Sect. 2. 


Atomic and Critical Sections. An OpenMP atomic construct is modeled by 
introducing a global “atomic lock” which is acquired before executing the atomic 
statement and then released. The acquire and release actions are then trans- 
formed as described above. Similarly, a lock is introduced for each critical section 
name (and the anonymous critical section); this lock is acquired before entering 
a critical section with that name and released when departing. 


Worksharing Constructs. Upon arriving at a for construct, a thread invokes a 
function that returns the set of iterations for which the thread is responsible. 
The partitioning of the iteration space among the threads is controlled by the 
construct clauses and various command line options. If the construct specifies 
the distribution strategy precisely, then the model uses only that distribution. If 
the construct does not specify the distribution, then the decisions are based on 
command line options. One option is to explore all possible distributions. In this 
case, when the first thread arrives, a series of nondeterministic choices is made 
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to construct an arbitrary distribution. The verifier explores all possible choices, 
and therefore all possible distributions. This enables a complete analysis of the 
loop’s execution space, but at the expense of a combinatorial explosion with 
the number of threads or iterations. A different command line option allows the 
user to specify a particular default distribution strategy, such as cyclic. These 
options give the user some control over the completeness-tractability tradeoff. 
For sections, only cyclic distribution is currently supported, and a single 
construct is executed by the first thread to arrive at the construct. 


3.4 Evaluation 


We applied our verifier to a suite comprised of benchmarks from DataRaceBench 
(DRB) version 1.3.2 [35] and some examples written by us that use different 
concurrency patterns. As a basis for comparison, we applied a state-of-the-art 
static analyzer for OpenMP race detection, LLOV v.0.3 [10], to the same suite.” 

LLOV v.0.3 implements two static analyses. The first uses polyhedral anal- 
ysis to identify data races due to loop-carried dependencies within OpenMP 
parallel loops [9]. It is unable to identify data races involving critical sections, 
atomic operations, master or single directives, or barriers. The second is a phase 
interval analysis to identify statements or basic blocks (and consequently mem- 
ory accesses within those blocks) that may happen in parallel [10]. Phases are 
separated by explicit or implicit barriers and the minimum and maximum phase 
in which a statement or basic block may execute define the phase interval. The 
phase interval analysis errs in favor of reporting accesses as potentially happen- 
ing in parallel whenever it cannot prove that they do not; consequently, it may 
produce false alarms. 

The DRB suite exercises a wide array of OpenMP language features. Of the 
172 benchmarks, 88 use only the language primitives supported by our CIVL 
OpenMP transformer (see Sect. 3.1). Some of the main reasons benchmarks were 
excluded include: use of C++, simd and task directives, and directives for GPU 
programming. All 88 programs also use only features supported by LLOV. Of 
the 88, 47 have data races and 41 are labeled race-free. 

We executed CIVL on the 88 programs, with the default number of OpenMP 
threads for a parallel region bounded by 8 (with a few exceptions, described 
below). We chose cyclic distribution as the default for OpenMP for loops. Many 
of the programs consume positive integer inputs or have clear hard-coded inte- 
ger parameters. We manually instrumented 68 of the 88, inserting a few lines of 
CIVL-C code, protected by a preprocessor macro that is defined only when the 
program is verified by CIVL. This code allows each parameter to be specified on 
the CIVL command line, either as a single value or by specifying a range. In a few 
cases (e.g., DRB055), “magic numbers” such as 500 appear in multiple places, 


? While there are a number of effective dynamic race detectors, the goal of those tools 
is to detect races on a particular execution. Our goal is more aligned with that 
of static analyzers: to cover as many executions as possible, including for different 
inputs, number of threads, and thread interleavings. 
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// DRB14 / DRBO14 (race) / oní 
int a, i; int n=100, m=100; double *u, *v; 
#pragma omp parallel private(i) | double b[n] [m]; // al nit u 
{ #pragma omp parallel for \ | for (t=0; t<steps; t++) { 
#pragma omp master private(j) #pragma omp parallel for 
a= 0; for (i=1;i<n;i++) for (i=1; i<n-1; i++) { 
#pragma omp for reduction(+:a) | for (j=0;j<m;j++) u[i]=v[i]+c*(v[i-1]+v[i]); 
for (i=0; i<10; i++) // out of bound acces } 

a=ati; bli] [j]=b[i] [j-1]; u=v; v=u; // 
} } 


Fig. 3. Excerpts from three benchmarks with data races: two from DataRaceBench 
(left and middle) and erroneous 1d-diffusion (right). 


which we replaced with an input parameter controlled by CIVL. These modi- 
fications are consistent with the “small scope” approach to verification, which 
requires some manual effort to properly parameterize the program so that the 
“scope” can be controlled. 

We used the range 1..10 for inputs, again with a few exceptions. In three 
cases, verification did not complete within 3 min and we lowered these bounds as 
follows: for DRB043, thread bound 8 and input bound 4; for the Jacobi iteration 
kernel DRB058, thread bound 4 and bound of 5 on both the matrix size and 
number of iterations; for DRB062, thread bound 4 and input bound 5. 

CIVL correctly identified 40 of the 41 data-race-free programs, failing only 
on DRB139 due to nested parallel regions. It correctly reported a data race for 
45 of the 47 programs with data races, missing only DRB014 (Fig. 3, middle) and 
DRB015. In both cases, CIVL reports a bound issue for an access to b[i] [j-1] 
when i > 0 and j = 0, but fails to report a data race, even when bound checking 
is disabled. 

LLOV correctly identified 46 of the 47 programs with data races, failing to 
report a data race for DRB140 (Fig. 3, left). The semantics for reduction specify 
that the loop behaves as if each thread creates a private copy, initially 0, of 
the shared variable a, and updates this private copy in the loop body. At the 
end of the loop, the thread adds its local copy onto the original shared variable. 
These final additions are guaranteed to not race with each other. In CIVL, this is 
modeled using a lock. However, there is no guarantee that these updates do not 
race with other code. In this example, thread 0 could be executing the assignment 
a=0 while another thread is adding its local result to a—a data race. This race 
issue can be resolved by isolating the reduction loop with barriers. 

LLOV correctly identified 38 out of 41 data-race-free programs. It reported 
false alarms for DRB052 (no support for indirect addressing), DRB054 (failure 
to propagate array dimensions and loop bounds from a variable assignment), 
and DRB069 (failure to properly model OpenMP lock behavior). 

The DRB suite contains few examples with interesting interleaving depen- 
dencies or pointer alias issues. To complement the suite, we wrote 10 additional 
C/OpenMP programs based on widely-used concurrency patterns (cf. [1]): 


— 3 implementations of a synchronization signal sent from one thread to 
another, using locks or busy-wait loops with critical sections or atomics; 
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// atomic3 (no race) // bari 
int x=0, s=0; TT saa? 2/initialize locks 10, 11; 
#pragma omp parallel sections \ #pragma omp parallel num_threads(2) 
shared(x,s) num_threads(2) 
{ int tid = omp_get_thread_num() ; 
#pragma omp section if (tid == 0) omp_set_lock(&10) ; 
{ else if (tid == 1) omp_set_lock(&11); 
x=1; #pragma omp barrier 
#pragma omp atomic write seq_cst if (tid == 0) x=0; 
s=1; if (tid == 0) { 
} omp_unset_lock(&10) ; 
#pragma omp section omp_set_lock(&11) ; 
{ } else if (tid == 1) { 
int done = 0; omp_set_lock(&10) ; 
while (!done) { omp_unset_lock(&11) ; 
#pragma omp atomic read seq_cst } 
done = s; if (tid == 1) x=1; 
} #pragma omp barrier 
x=2; if (tid == 0) omp_unset_lock(&11); 
} else if (tid == 1) omp_unset_lock(&10) ; 
} } 


Fig. 4. Code for synchronization using an atomic variable (left) and a 2-thread barrier 
using locks (right). 


— 3 implementations of a 2-thread barrier, using busy-wait loops or locks; 

— 2implementations of a 1d-diffusion simulation, one in which two copies of the 
main array are created by two separate malloc calls; one in which they are 
inside a single malloced object; and 

— an instance of a single-producer, single-consumer pattern; and a multiple- 
producer, multiple-consumer version, both using critical sections. 


For each program, we created an erroneous version with a data race, for a total 
of 20 tests. These codes are included in the experimental archive, and two are 
excerpted in Fig. 4. 

CIVL obtains the expected result in all 20. While we wrote these additional 
examples to verify that CIVL can reason correctly about programs with complex 
interleaving semantics or alias issues, for completeness we also evaluated them 
with LLOV. It should be noted, however, that the authors of LLOV warn that it 
“... does not provide support for the OpenMP constructs for synchronization...” 
and “...can produce False Positives for programs with explicit synchronizations 
with barriers and locks.” [9] It is therefore unsurprising that the results were 
somewhat mixed: LLOV produced no output for 6 of our examples (the racy 
and race-free versions of diffusion2 and the two producer-consumer codes) and 
produced the correct answer on 7 of the remaning 14. On these problems, LLOV 
reported a race for both the racy and race-free version, with the exception of 
diffusion1 (Fig. 3, right), where a failure to detect the alias between u and v leads 
it to report both versions as race-free. 

CIVL’s verification time is significantly longer than LLOV’s. On the DRB 
benchmarks, total CIVL time for the 88 tests was 27 min. Individual times ranged 
from 1 to 150 seconds: 66 took less than 5s, 80 took less than 30s, and 82 took 
less than 1 min. (All CIVL runs used an M1 MacBook Pro with 16GB memory.) 
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Total CIVL runtime on the 20 extra tests was 210s. LLOV analyzes all 88 DRB 
problems in less than 15s (on a standard Linux machine). 


4 Related Work 


By Theorem 1, if barriers are the only form of synchronization used in a program, 
only a single interleaving will be explored, and this suffices to verify race-freedom 
or to find all states at the end of each barrier epoch. This is well known in other 
contexts, such as GPU kernel verification (cf. [5]). 

Prior work involving model checking and data races for unstructured con- 
currency includes Schemmel et al. [29]. This work describes a technique, using 
symbolic execution and POR, to detect defects in Pthreads programs. The app- 
roach involves intricate algorithms for enumerating configurations of prime event 
structures, each representing a set of executions. The completeness results deal 
with the detection of defects under the assumption that the program is race- 
free. While the implementation does check for data races, it is not clear that the 
theoretical results guarantee a race will be found if one exists. 

Earlier work of Elmas et al. describes a sound and precise technique for 
verifying race-freedom in finite-state lock-based programs [16]. It uses a bespoke 
POR-based model checking algorithm that associates significant and complex 
information with the state, including, for each shared memory location, a set of 
locks a thread should hold when accessing that location, and a reference to the 
node in the depth first search stack from which the last access to that location 
was performed. 

Both of these model checking approaches are considerably more complex than 
the approach of this paper. We have defined a simple state-transition system and 
shown that a program has a data race if and only if a state or edge satisfying 
a certain condition is reachable in that system. Our approach is agnostic to the 
choice of algorithm used to check reachability. The earlier approaches are also 
path-precise for race detection, i.e., for each execution path, a race is detected if 
and only if one exists on that path. As we saw in the example following Theorem 
1, our approach is not path-precise, nor does it have to be: to verify race-freedom, 
it is only necessary to find one race in one execution, if one exists. This partly 
explains the relative simplicity of our approach. 

A common approach for verifying race-freedom is to establish consistent 
correlation: for each shared memory location, there is some lock that is held 
whenever that location is accessed. LOCKSMITH [27] is a static analysis tool for 
multithreaded C programs that takes this approach. The approach should never 
report that a racy program is race-free, but can generate false alarms, since there 
are race-free programs that are not consistently correlated. False alarms can also 
arise from imprecise approximations of the set of shared variables, alias analysis, 
and so on. Nevertheless, the technique appears very effective in practice. 

Static analysis-based race-detection tools for OpenMP include OMPRacer 
[33]. OMPRacer constructs a static graph representation of the happens-before 
relation of a program and analyzes this graph, together with a novel whole- 
program pointer analysis and a lockset analysis, to detect races. It may miss 
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violations as a consequence of unsound decisions that aim to improve perfor- 
mance on real applications. The tool is not open source. The authors subse- 
quently released OpenRace [34], designed to be extensible to other parallelism 
dialects; similar to OMPRacer, OpenRace may miss violations. Prior papers by 
the authors present details of static methods for race detection, without a tool 
that implements these methods [32]. 

PolyOMP [12] is a static tool that uses a polyhedral model adapted for a 
subset of OpenMP. Like most polyhedral approaches, it works best for affine 
loops and is precise in such cases. The tool additionally supports may-write 
access relations for non-affine loops, but may report false alarms in that case. 
DRACO [36] also uses a polyhedral model and has similar drawbacks. 

Hybrid static and dynamic tools include Dynamatic [14], which is based on 
LLVM. It combines a static tool that finds candidate races, which are subse- 
quently confirmed with a dynamic tool. Dynamatic may report false alarms and 
miss violations. 

ARCHER [2] is a tool that statically determines many sequential or prov- 
ably non-racy code sections and excludes them from dynamic analysis, then 
uses TSan [30] for dynamic race detection. To avoid false alarms, ARCHER 
also encodes information about OpenMP barriers that are otherwise not under- 
stood by TSan. A follow-up paper discusses the use of the OMPT interface 
to aid dynamic race detection tools in correctly identifying issues in OpenMP 
programs [28], as well as SWORD [3], a dynamic tool that can stay within user- 
defined memory bounds when tracking races, by capturing a summary on disk 
for later analysis. 

ROMP [18] is a dynamic/static tool that instruments executables using the 
DynInst library to add checks for each memory access and uses the OMPT inter- 
face at runtime. It claims to support all of OpenMP except target and simd con- 
structs, and models “logical” races even if they are not triggered because the con- 
flicting accesses happen to be scheduled on the same thread. Other approaches 
for dynamic race detection and tricks for memory and run-time efficient race 
bookkeeping during execution are described in [11, 19,20, 24]. 

Deductive verification approaches have also been applied to OpenMP pro- 
grams. An example is [6], which introduces an intermediate parallel language and 
a specification language based on permission-based separation logic. C programs 
that use a subset of OpenMP are manually annotated with “iteration contracts” 
and then automatically translated into the intermediate form and verified using 
VerCors and Viper. Successfully verified programs are guaranteed to be race-free. 
While these approaches require more work from the user, they do not require 
bounding the number of threads or other parameters. 


5 Conclusion 


In this paper, we introduced a simple model-checking technique to verify that a 
program is free from data races. The essential ideas are (1) each thread “remem- 
bers” the accesses it performed since its last synchronization operation, (2) a 
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partial order reduction scheme is used that treats all memory accesses as local, 
and (3) checks for conflicting accesses are performed around synchronizations. 
We proved our technique is sound and precise for finite-state models, using a 
simple mathematical model for multithreaded programs with locks and barriers. 
We implemented our technique in a prototype tool based on the CIVL symbolic 
execution and model checking platform and applied it to a suite of C/OpenMP 
programs from DataRaceBench. Although based on completely different tech- 
niques, our tool achieved performance comparable to that of the state-of-the-art 
static analysis tool, LLOV v.0.3. 

Limitations of our tool include incomplete coverage of the OpenMP speci- 
fication (e.g., target, simd, and task directives are not supported); the need 
for some manual instrumentation; the potential for state explosion necessitat- 
ing small scopes; and a combinatorial explosion in the mappings of threads to 
loop iterations, OpenMP sections, or single constructs. In the last case, we have 
compromised soundness by selecting one mapping, but in future work we will 
explore ways to efficiently cover this space. On the other hand, in contrast to 
LLOV and because of the reliance on model checking and symbolic execution, 
we were able to verify the presence or absence of data races even for programs 
using unstructured synchronization with locks, critical sections, and atomics, 
including barrier algorithms and producer-consumer code. 
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Abstract. IC3/PDR and its variants have been the prominent 
approaches to safety model checking in recent years. Compared to the 
previous model-checking algorithms like BMC (Bounded Model Check- 
ing) and IMC (Interpolation Model Checking), IC3/PDR is attractive due 
to its completeness (vs. BMC) and scalability (vs. IMC). IC3/PDR main- 
tains an over-approximate state sequence for proving the correctness. 
Although the sequence refinement methodology is known to be crucial 
for performance, the literature lacks a systematic analysis of the prob- 
lem. We propose an approach based on the definition of i- good lemmas, 
and the introduction of two kinds of heuristics, i.e., branching and refer- 
skipping, to steer the search towards the construction of i-good lemmas. 
The approach is applicable to IC3 and its variant CAR (Complementary 
Approximate Reachability), and it is very easy to integrate within exist- 
ing systems. We implemented the heuristics into two open-source model 
checkers, IC3Ref and SimpleCAR, as well as into the mature nuXmv plat- 
form, and carried out an extensive experimental evaluation on HWMCC 
benchmarks. The results show that the proposed heuristics can effec- 
tively compute more i-good lemmas, and thus improve the performance 
of all the above checkers. 


1 Introduction 


Safety model checking is a fundamental problem in verification. The goal is to 
prove that all the reachable states of the transition system (J, T} satisfy a prop- 
erty P. The field has been dominated by SAT-based techniques since the intro- 
duction of Bounded Model Checking (BMC) [9]. The first wave of SAT-based 
model-checking algorithms, including BMC, k-induction [31] and Interpolation- 
based Model Checking [25] have been superseded by the research deriving 
from the seminal work of Bradley [11]. The IC3 algorithm maintains an over- 
approximate state sequence for proving the correctness; it avoids unrolling the 
transition relation by localizing reasoning to frames, used to incrementally build 
an inductive invariant by discovering inductive clauses. 
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IC3 (also known as PDR [17]) has spawned several variants, including those 
that attempt to combine forward and backward search [29]. Particularly relevant 
in this paper is CAR (Complementary Approximate Reachability), which com- 
bines the forward overapproximation with a backward underapproximation [23]. 

It has been noted that different ways to refine the over-approximating sequence 
can impact the performance of the algorithm. For example, [21] attempts to dis- 
cover good lemmas, that can be “pushed to the top” since they are inductive. In 
this paper, we propose an alternative way to drive the refinement of the over- 
approximating sequence. We identify i- good lemmas, i.e. lemmas that are induc- 
tive with respect to the i-th overapproximating level. The intuition is that such 
i-good lemmas are useful in the search since they are fundamental to reach a fix 
point in the safe case. In order to guide the search towards the discovery of i-good 
lemmas, we propose a heuristic approach based on two key insights, i.e., branching 
and refer-skipping. First, with branching we try to control the way the SAT solver 
extracts unsatisfiable cores by privileging variables occurring in 7-good lemmas. 
Second, we control lemma generalization by avoiding dropping literals occurring 
in a subsuming lemma in the previous layer (refer-skipping). 

The proposed approach is applicable both to IC3/PDR and CAR, and it is 
very simple to implement. Yet, it appears to be quite effective in practice. We 
implemented the i-good lemma heuristics in two open-source implementations 
of IC3 and CAR, and also in the mature, state-of-the-art IC3 implementation 
available inside the nuXmv model checker [12], and we carried out an extensive 
experimental evaluation on Hardware Model Checking Competition (HWMCC) 
benchmarks. Analysis of the results suggests that increasing the ratio of i-good 
lemmas leads to an increase in performance, and the heuristics appear to be quite 
effective in driving the search towards i-good lemmas. In terms of performance, 
this results in significant improvements for all the tools when equipped with the 
proposed approach. 

This paper is structured as follows. In Sect. 2 we present the problem and the 
IC3/PDR and CAR algorithms. In Sect. 3 we present the intuition underlying i- 
good lemmas and the algorithms to find them. In Sect. 4 we overview the related 
work. In Sect. 5 we present the experimental evaluation. In Sect. 6 we draw some 
conclusions and present directions for future work. 


2 Preliminaries 


2.1 Boolean Transition System 


A Boolean transition system Sys is a tuple (X,Y,I,T), where X and X’ denote 
the set of state variables in the present state and the next state, respectively, 
and Y denotes the set of input variables. The state space of Sys is the set of 
possible assignments to X. I(X) is a Boolean formula corresponding to the set 
of initial states, and T(X, Y, X’) is a Boolean formula representing the transition 
relation. State s2 is a successor of state sı with input y iff sı Ay Ash = T, which 
is also denoted by (s1, y, $2) € T. In the following, we will also write (s1, s2) € T 
meaning that (s1, y, 2) € T for some assignment y to the input variables. A path 


290 Y. Xia et al. 


of length k is a finite state sequence $1, S2,..., Sk, where (si, Si+1) € T holds for 
(1 <i<k-—1). A state t is reachable from s in k steps if there is a path of length 
k from s to t. Let S be a set of states in Sys. We overload T and denote the 
set of successors of states in S as T(S) = {t | (s,t) € T,s € S}. Conversely, we 
define the set of predecessors of states in S as T~1(S) = {s | (s,t) E T,t € S}. 
Recursively, we define T°(S) = S and T**1(S) = T(T'(S)) where i > 0; the 
notation T~*(,S) is defined analogously. In short, T*(S) denotes the states that 
are reachable from S in i steps, and T~*(S) denotes the states that can reach S 
in 7 steps. 


2.2 Safety Checking and Reachability Analysis 


Given a transition system Sys = (X,Y,1I,T) and a safety property P, which 
is a Boolean formula over X, a model checker either proves that P holds for 
any state reachable from an initial state in J, or disproves P by producing a 
counterexample. In the former case, we say that the system is safe, while in the 
latter case, it is unsafe. A counterexample is a finite path from an initial state 
s to a state t violating P, i.e., t € ~P, and such a state is called a bad state. 
In symbolic model checking, safety checking is reduced to symbolic reachabil- 
ity analysis. Reachability analysis can be performed in a forward or backward 
search. Forward search starts from initial states J and searches for bad states 
by computing T’(I) with increasing values of i, while backward search begins 
with states in ~P and searches for initial states by computing T~'(=P) with 
increasing values of i. Table 1 gives the corresponding formal definitions. 


Table 1. Exact reachability analysis. 


Forward Backward 
Base Fo=I Bo = ~P 
Induction Pia = TS) Bua =T (Bi) 
Safe Check Fai C Ubeses Fj Bisa C Uv<s<i B; 
Unsafe Check F; N =P 40 BiıiQI #0 


For forward search, F; denotes the set of states that are reachable from I 
within 7 steps, which is computed by iteratively applying T. At each iteration, 
we first compute a new F;, and then perform safe checking and unsafe checking. If 
the safe/unsafe checking hits, the search terminates. Intuitively, unsafe checking 
F; Q =P Æ @ indicates some bad states are within F; and safe checking Fi.1 C 
Uo<j<; Fj indicates that all reachable states from J have been checked and none 
of them violate P. For backward search, B; is the set of states that can reach 
~P in i steps, and the search procedure is analogous to the forward one. 


Notations. A literal is an atomic variable or its negation. If l is a literal, we 
denote its corresponding variable with var(1). A cube (resp. clause) is a conjunc- 
tion (resp. disjunction) of literals. The negation of a clause is a cube and vice 
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versa. A formula in Conjunctive Normal Form (CNF) is a conjunction of clauses. 
For simplicity, we also treat a CNF formula ¢ as a set of clauses and make no 
difference between the formula and its set representation. Similarly, a cube or a 
clause c can be treated as a set of literals or a Boolean formula, depending on 
the context. 

We say a CNF formula ¢ is satisfiable if there exists an assignment of its 
Boolean variables, called a model, that makes ¢ true; otherwise, ¢ is unsatisfiable. 
A SAT solver is a tool that can decide the satisfiability of a CNF formula ¢. In 
addition to providing a yes/no answer, modern SAT solvers can also produce 
models for satisfiable formulas, and unsatisfiable cores (UC), i.e. a reason for 
unsatisfiability, for unsatisfiable ones. More precisely, in the following we shall 
assume to have a SAT solver that supports the following API (which is standard 
in state-of-the-art SAT solvers based on the CDCL algorithm [24]): 


— is SAT(d,.A) checks the satisfiability of d under the given assumptions A, 
which is a list of literals. This is logically equivalent to checking the satisfia- 
bility of dA A A, but is typically more efficient; 

— get_UC() retrieves an UC of the assumption literals of the previous SAT call 
when the formula ¢ A A A is unsatisfiable. That is, the result is a set uc C A 
such that ¢/ A uc is unsatisfiable; 

— get_model() retrieves the model of the formula ¢ A A A of the previous SAT 
call, if the formula is satisfiable. 


2.3 Overview of IC3 and CAR 


IC3 is a SAT-based and complete safety model checking algorithm proposed 
in [11], which only needs to unroll the system at most once. PDR [17] is a re- 
implementation of IC3 which optimizes the original version in different aspects. 
To prove the correctness of a given system Sys = (X,Y,I,T) w.r.t. the safety 
property P, IC3/PDR maintains a monotone over-approximate state sequence O 
such that (1) Oo = I and (2) Oj4,; 2 O; UT(O,;) for i > 0. From the perspective 
of reachability analysis, IC3 performs as shown in the left part of Table 2. Since 
O is monotone, the states search can converge as soon as Oj4; = O; holds for 
some i > 0. Otherwise, a state path (counterexample) starting from I to some 
state in =P can be detected (T~*(=P) NI #90). 


Table 2. A high-level description of IC3 (left) and (Forward) CAR (right). 


Over-approximate Under-approximate Over-approximate — Under-approximate 
Base Oo=1 - Base Oo =I Up = i 
Induction Oj41 D O; UT(Oi) - Induction Oi+ı 2D T(Oi) Ui4a & T! (U:) 
Safe Check Fi: Oi41 = O; - Safe Check Ji- Oi+ı C Uoz<j<i O; - 
Unsafe Check - Fi-T(AP)NIF4O Unsafe Check - Ji-U:NI #0 


CAR [23] is a recently proposed algorithm, which can be considered as a 
general version of IC3. The main points CAR differs from IC3 are as follows: 
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Algorithm 1. Overview of IC3 


1: 
2 
3 
A: 
5: 
6 
T 
8 


procedure IC3(I, T, P) 
if is_SAT(I A =P) then // unsafe check of initial state 
return unsafe 
Oo :=1,k :=1, On := T 
while true do 
while is.SAT(O;, A =P) do 
s := get_model() // s (= Ok A7P 
if UNSAFECHECK(s, k — 1) then 
return unsafe // countererample found 
k:=k+1, O, := T 
if SAFECHECK(k) then 
return safe // property proved 


: function UNSAFECHECK(s, i) 


while is_SAT(O; Aas AT, s’) do 
if i= 0 then 
return true 
t :=GET_PREDECESSOR(s, i) // (t,s) € T, see Algorithm 4 
if UNSAFECHECK(t, i — 1) then 
return true 
c := GENERALIZE({l|l’ € get_UC()}, i) // cC s, see Algorithm 3 
O= OF 0ng iSS 
return false 


: function SAFECHECK(k) 


PROPAGATION(k) // see Algorithm 4 
i= 
while i < k do 
if EA = Oi+ı then 
return true 
return false 


The over-approximate state sequence O in CAR is not necessarily monotone. 
Therefore, CAR has to apply the standard invariant-checking approach, i.e., 
finding a position 7 > 0 such that Oj41 C Uo<j<i O; holds, as shown in the 
right part of Table 2. 

Besides the O sequence, CAR also maintains an under-approximate state 
sequence U that stores reachable (real) states from ~P, see Table 2. The 
motivation to introduce the U sequence is to re-use the intermediate states 
that are computed during proving. Although it is straightforward for IC3 to 
introduce such a sequence, the effect on the performance remains unknown. 
CAR can be performed in both forward, i.e., proving from I while search- 
ing states back from ~P, and backward, i.e., proving back from ~P while 
searching states from J. Although Backward CAR is not good at proving, it is 
advantageous in finding bugs, i.e., checking unsafety [16,22]. Relevant work on 
reverse 1C3/PDR [28], which corresponds to Backward CAR, has been studied 
but the results did not clearly support its advantage on bug-finding. 


An overview of IC3 and (forward) CAR is shown in Algorithm 1 and Algo- 


rithm 2 respectively. At a high level, both algorithms have a similar structure, 
consisting of an alternation of two phases: unsafe check and safe check. The 
unsafe check (line 14 of Algorithm 1, line 14 of Algorithm 2) tries to find a state 
sequence that is a path between J and ~P; if such a sequence can be found, 
then it is a counterexample witnessing the violation of P; otherwise, the O; are 
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Algorithm 2. Overview of CAR 


1: procedure CAR-FORWARD(I, T, P) 
2: if is_SAT(I A =P) then // unsafe check of initial state 
3: return unsafe 
A: Oo :=1,U := {AP}, k :=0 
5: while true do 
6: while is_SAT(U) do 
T s := get_model() //sEU 
8: if UNSAFECHECK(s, k) then 
9: return unsafe // counterexample found 
10: if SAFECHECK(k) then 
11: return safe // property proved 
12: k:=k+1, Ok:=T 
13: 
14: function UNSAFECHECK(s, i) 
15: while is SAT(O; A T, s’) do 
16: if i= 0 then 
17: return true 
18: t := GET_PREDECESSOR(s, 7) // (t,s) E€ T, see Algorithm 5 
19: U:=UU{t} 
20: if UNSAFECHECK(t, i — 1) then 
21: return true 
22: c := GENERALIZE({l|l’ € get-UC()}, i) // cC s, see Algorithm 3 
23: Oii := Oiza Ne 
24: return false 
25: 
26: function SAFECHECK(k) 
27: PROPAGATION(k) // see Algorithm 5 
28: i:=0 
29: while i < k do 
30: if not is SAT(Oi4+1 A =(Vo<;j<iOj)) then 
31: return true Eg 
32: return false 


strengthened with additional clauses until O; is strong enough to imply P.' The 
safe check (line 25 of Algorithm 1, line 26 of Algorithm 2) tries to propagate the 
clauses in O; to O;,, and check if a fixpoint is reached. If so then the algorithm 
terminates. Both algorithms make use of similar additional procedures, which 
will be detailed in the following section, when we introduce our novel heuristics. 


3 Finding 7-Good Lemmas 


In this section, we introduce the concept of i-good lemmas, define the heuristics to 
steer the search towards i-good lemmas and describe the IC3 and CAR algorithms 
enhanced with i-good lemmas. For the sake of convenient description, we fix the 
input system Sys = (X,Y, I, T) and the property P to be verified. In describing 
the implementation of our heuristics, we shall necessarily assume that the reader 
has some familiarity with the low-level details of IC3 and CAR, for which we 
refer to [11,17,23]. Specifically, we shall use pseudo-code descriptions of the main 
components of the algorithms (Algorithm 3, 4, and 5), in which the modifications 
required to implement our heuristics are highlighted in blue. 


1 Note that in the unsafe check, the meaning of the SAT query isSAT(O; AT, s’) is 
different between CAR and IC3 (line 15 Algorithm 2) so that when it is unsatisfiable 
the obtained clauses have different semantics. 
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3.1 What Are 7-good Lemmas 


The over-approximate state sequence O in IC3 (resp. CAR) is a finite sequence, in 
which every element O; (0 < i < |O|), namely frame i, is an over-approximation 
of the states of the system that are reachable in up to (resp. exactly) i steps 
from J, and which is strong enough to imply P. Such sequence O has the form 
of P AC, where C is a CNF, and each clause in C is called a lemma. For both 
algorithms, the goal is that of transforming the sequence O to construct an 
over-approximation of all the reachable states of the system (over an unbounded 
horizon) that still implies P. When this happens, such over-approximation is an 
inductive invariant that proves P. The key idea, common to both IC3 and to 
CAR, is to construct the invariant incrementally and by reasoning in a localized 
manner, by (i) considering increasingly-long sequences of overapproximations, 
and by (ii) trying to propagate forward individual lemmas from a frame O; to 
its successor O;,1, until a fixpoint is reached”. The forward propagation proce- 
dure is crucial for ensuring the convergence of the algorithm in practice: for IC3 
(resp. CAR), it checks whether a lemma c at frame i represents also an overap- 
proximation of all the states reachable in up to (resp. exactly) i + 1 steps, and 
therefore can be added to frame 7+ 1. It is immediate to see that the successful 
propagation of all lemmas from i to i+ 1, for some i, is a sufficient condition for 
the termination of both IC3 and CAR with a safe result. In fact, for IC3, this is 
also a necessary condition. 
We now introduce the notion of i-good lemma. 


Definition 1 (i-Good Lemma). Let c be a lemma that was added at frame 
i by IC3/CAR (at some previous step in the execution of the algorithm), i.e. 
O; | c. We say that c is i-good if c now holds also at frame i+1, i.e. Oi41 F c. 


The following theorems are then consequences of the definition. 


Theorem 1. IC3 terminates with safe at frame i (i > 0), if and only if every 
lemma at frame i is i-good. 


Theorem 2. CAR terminates with safe at frame i (i > 0), if every lemma at 
frame i is i-good. 


Such theorems provide the theoretical foundation on which we base our main 
conjecture: the computation of 7-good lemmas can be helpful for both IC3 and 
CAR to accelerate the convergence in proving properties. Intuitively, an i-good 
lemma shows the promise of being independent of the reachability layer, and 
hence holds in general. 


? The algorithms differ in the way they check reaching the fixpoint, but this difference 
will be ignored unless otherwise stated. 
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3.2 Searching for i-good Lemmas 


Our conjecture is that there exists, on average, a positive correlation between 
the ratio of i-good lemmas vs the total amount of lemmas computed by IC3/CAR 
during generalization and the efficiency of the algorithm. 

Ensuring that only i-good lemmas are produced is as hard as solving the 
verification problem itself, since this is essentially equivalent to synthesizing an 
inductive invariant which implies P. However, there are two situations in which 
it is easy to identify i-good lemmas, for both IC3 and CAR: 


1. In the propagation procedure, if a lemma c can be successfully pushed from 
frame i to frame i + 1, then c is 7-good; 

2. In the generalize procedure, if the current lemma c at frame 7 is generalized 
to a lemma p C c such that p € O;-1, then p is (i — 1)-good; additionally, 
if we can guide the generalization of c so that it produces p, then p becomes 
(i — 1)-good. 


Therefore, we do not attempt to compute only 7-good lemmas, but rather, 
our main idea is to use some (cheap) heuristics to increase the probability of 
producing i-good lemmas during the normal execution of IC3 and CAR. 

We exploit the above observations to design two heuristics that try to bias 
the search for lemmas towards those that are more likely to be 7-good, which we 
call respectively branching and refer-skipping. 


Branching. The branching strategy [26] is an important feature of modern 
CDCL (Conflict-Driven Clause Learning) SAT solvers [7]. Traditional scoring 
schemes for branching such as VSIDS and EVSIDS have been extensively evalu- 
ated in [10]. In CDCL SAT solvers, decision variables are selected according to 
their priority. Whenever a conflict occurs, the priority of each variable in the 
clause is increased. To this end, variables that have recently been involved in 
conflicts are more likely to be selected as decision variables. 

We adopt a similar idea in our branching heuristic for IC3/CAR to bias the 
unsatisfiable cores produced by the SAT solver, by ordering the assumptions in 
SAT queries according to their score. This is based on the fact that modern SAT 
solvers based on CDCL apply the assumption literals in the order given by the 
user, and (as a consequence of how CDCL works) the unsatisfiable core produced 
when the formula is unsatisfiable depends on such order, with literals occurring 
earlier in the assumption list being more likely to be included in the core. For 
example, assume the SAT query is is SAT(-1 A (2 V 73), 1 A 32 A3), which is 
unsatisfiable, then the returned UC from the SAT solver, e.g., Minisat [5,18], will 
be {1}. If the order of assumptions is changed to 3A =2 A 1, then the UC will be 
{3,72}. 

Since UCs are the source for lemmas in both IC3 and CAR, the first idea of 
our branching heuristic is that of sorting the assumption literals in SAT queries 
according to how often they occur in recent i- good lemmas. Concretely, this is 
implemented as follows: 


— We introduce a mapping Sju] : V — scorey,v € X from each variable to its 
score (priority). Initially, all variables have the same score of 0. 
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Before each SAT query in which a (negated) lemma c (or its next-state version 
c’) is part of the assumptions, c is sorted in descending order of STwar(l)]; Where 
l € c, to give higher priority to assumption literals with higher scores. This 
corresponds to the calls to the function SORT(c) in the pseudo-code description 
of the main components of IC3 and CAR: at the beginning of UNSAFECHECK 
(Algorithm 1 and 2), in GET_PREDECESSOR (line 6 of Algorithm 4, line 6 of 
Algorithm 5), and in GENERALIZATION (line 25 of Algorithm 4, line 23 of 
Algorithm 5). 

Whenever IC3 or CAR discovers an i-good lemma c, all the variables in c are 
rewarded by increasing their score. A lemma c is determined to be i-good 
either when it is propagated forward from frame 7 to frame i+ 1 (function 
PROPAGATION of Algorithm 4 and 5) or when c is the result of a generaliza- 
tion from d D c at frame i+ 1 such that c is already in frame i (function 
GENERALIZE, Algorithm 3). In the pseudo-code, the reward steps correspond 
to the calls to the function REWARD(c) at line 12 of Algorithm 3, line 42 of 
Algorithm 4, and line 37 of Algorithm 5. The REWARD function first decays 
the scores of all the variables in Sw] by a small amount (we multiply by 0.99 
in our implementation), and then increments the score of all the variables in 
c (by 1 in our implementation). 

In order to determine whether GENERALIZE produced an 7-good lemma, we 
also use the function GET_PARENTNODE(c) (line 3 of Algorithm 3), which 
returns a cube p in frame i—1 such that p C c when c belongs to frame i. (If 
multiple such p exist, the one with the highest score is returned). 

When performing inductive generalization of a lemma c at frame i (Algo- 
rithm 3), in which c is strengthened by trying to drop literals from it as long 
as the result is still a valid lemma for frame i, the literals of c are sorted in 
increasing order of Sj,ar(yj, with | € c. This corresponds to the call to the 
function REVERSE_SORT(c) at line 2 of Algorithm 3 in the pseudo-code. 


Algorithm 3. Lemma Generalization of 1C3/CAR 


1; 
2 
3 
4: 
5: 
6 
7 
8 


function GENERALIZE(c, i, rec_lul = 1) 


REVERSED_SORT(c) // sort literals in c in increasing order of priority 
ap :=GET_PARENTNODE (~c) // 7p € Fi-1(Oi-1) and pC c 
reg:=p // skip literals in p 
for each l € c and l ¢ req do 
cm := c \ {1} 
if DOWN(cm, i, rec_lul, req) then // CTG-based dropping, see Algorithm 4 and 5 
c:=cm 
else 
req := req U {l} // failed to drop l 
if c \ p = Ø then // whether c is a good lemma 
REWARD(c) // raise priority of variables in c 
return c 


Skipping Literals by Reference. Lemma generalization is a crucial process 


in 


IC3/CAR that affects performance significantly. Given the original lemma c 
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Algorithm 4. Auxiliary functions for IC3 


1: function GET_PREDECESSOR(s, i) // generalization of predecessors 
2: ASSERT(is_ SAT(O; A as AT, 8’)) // precondition: 3t that (t, s) € T 
3: H := get_model() 
4: in := {l € u|var(l) € Y} 
5: t := {l € plvar(l) E€ X} 
6: SORT(t) // sort literals in s in descending order of priority 
T: while not is-SAT(O; A in A 7s’, t) do 
8: if t =get_UC() then 
9: break 
10: t :=get_UC() 
11: return t 
12: 
13: function DOWN(c, i, rec-lvl, req) // CTG-based dropping literals 
14: cex_num := 0 
15: while true do 
16: if is SAT(J Ac) then 
17: return false 
18: if not is SAT(O; A 7c AT, c’) then 
19: c:={l|l/! € get_UC()} 
20: return true 
21: else if rec_lul > MAX_REC_LVL then // MAX_REC_LVL = 3 
22: return false 
23: else 
24: cex := GET_PREDECESSOR(c, i) // cex as a counter-example of generalization 
25: SORT(cex) // sort literals in s in descending order of priority 
26: if cer_num < MAX_CEX_NUM and i > 0 and not is_SAT(O;_-1 A ncex A T, cex’) 
and not is_SAT(I A cex) then // MAX_CEX_NUM = 3 
27: Ccea 1=GENERALIZE({I|l’ € get-UC()}, i — 1, rec-lul + 1) 
28: Op = Ok Steeg 1< KR <i—1 
29: cex_num + + 
30: else 
31: cex_num := 0 
32: if (c \ cex) N req # Ø then 
33: return false 
34: c := cN cer 
35: 
36: function PROPAGATION(k) 
37: i:=1 
38: for i < k do 
39: for ~c € O; do 
40: if not SAT(O; A ac AT, c’) then 
Al: Of, t= Ops Nae 
42: REWARD(c) // raise priority of variables in c 


to be added into frame i (i > 0), the GENERALIZE procedure tries to compute a 
new lemma g such that g C c and g is also valid to be added to frame i (O;). 
The main idea of generalization is to try to drop literals in the original lemma 
one by one, to see whether the left part can still be a valid lemma. 

There are several generalization algorithms with different trade-offs between 
efficiency (in terms of the number of SAT queries) and effectiveness (in terms 
of the potential reduction in the size of the generalized lemma), e.g. [11,17,20]. 
More in general, there might be multiple different ways in which a lemma c can 
be generalized, with results of uncomparable strength (i.e. there might be both 
gı © c and go C c such that gı Z g2 and g2 Z g1). 

The main idea of the refer-skipping heuristic is to bias the generalization to 
increase the likelihood that the result g is a (i — 1)-good lemma. Consider the 
generalization of lemma c = 71 V 2 V 73 at frame i (i > 1). If there is already a 


298 Y. Xia et al. 


Algorithm 5. Auxiliary functions for CAR 


1: function GET_PREDECESSOR(s, i) // generalization of predecessors 
2: ASSERT(is_SAT(O; A T, s’)) // precondition: 3t that (t, s) € T 
3: H := get_model() 
4: in := {l € u|var(l) E€ Y} 
5: t := {l € plvar(l) E€ X} 
6: SORT(t) // sort literals in s in descending order of priority 
G while not is SAT(O; A in A =s’, t) do 
8: if t =get_UC() then 
g: break 
10: t :=get-UC() 
11: return t 
12: 
13: function Down(c, i, rec_lvl) // CTG-based dropping literals 
14: cex_num := 0 
15: while true do 
16: if not is SAT(O; AT, c’) then 
17: c :={1|l' € get_UC()} 
18: return true 
19: else if rec_lul > MAX_REC_LVL then // MAX_REC_LVL = 3 
20: return false 
21: else 
22: cex := GET_PREDECESSOR(c, i) 
23: SORT(cex) // sort literals in s in descending order of priority 
24: if cer_num < MAX_CEX_NUM and i > 0 
and not is_SAT(O;_1 AT, cex) then // MAX_CEX_NUM = 3 
25: Ccex !=GENERALIZE({I|l’ € get_UC()}, i — 1, rec_lul + 1) 
26: Oj-1 t= Oii N ACen 
QT: cer_num + + 
28: else 
29: return false 
30: 
31: function PROPAGATION(k) 
32: tau. 
33: for i < k do 
34: for ~c € O; do 
35: if not SAT(O; A T,c’) then 
36: Oj41 := O141 N nc 
37: REWARD(c) // raise priority of variables in c 


lemma g = —1V-3 at frame i— 1, we say that g is a candidate (i—1)-good lemma 
for the generalization of c. In order to drive the generalization of c towards g, we 
blacklist the literals of g, so that GENERALIZE will never attempt to drop them 
from c. As such, we call g a reference for skipping generalization. In general, 
there might be multiple references for a given lemma. Currently, our strategy in 
refer-skipping is to just pick the one first found. 

The implementation of refer-skipping is based on existing generalization algo- 
rithms and only needs to add less than 10 lines in the pseudo-code (see line 4-10 
of Algorithm 3). As shown in the algorithm, a variable set req is maintained to 
store variables that fail to be dropped so that they are not tried to be removed 
again later. In order to use refer-skipping, we simply initialize req with the vari- 
ables occurring in the candidate (i — 1)-good lemma that is returned by the 
GET_PARENTNODE procedure (line 3 of Algorithm 3). 

Finally, note that although in our pseudo-code (and in our implementation) 
we use the CTG algorithm of [20], the idea discussed here can be applied also 
to the other variants of generalization just as easily. 
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4 Related Work 


In the field of safety model checking, after the introduction of IC3 [11], several 
variants have been presented: [20] presents the counterexample-guided general- 
ization (CTG) of a lemma by blocking states that interfere with it, which sig- 
nificantly improves the performance of IC3; AVY [33] introduces the ideas of IC3 
into IMC (Interpolant Model Checking) [25] to induce a better model checking 
algorithm; its upgrade version kAVY [32] uses k-induction to guide the interpola- 
tion and IC3/PDR generalization inside; [28] proposes to combine IC3/PDR with 
reverse IC3/PDR; the subsequent work [29] interleaves a forward and a back- 
ward execution of IC3 and strengthens one frame sequence by leveraging the 
proof-obligations from the other; IC3-INN [15] enables IC3 to leverage the inter- 
nal signal information of the system to induce a variant of IC3 that can perform 
better on certain industrial benchmarks; [30] introduces under-approximation in 
PDR to improve the performance of bug-finding. 

The importance of discovering inductive lemmas for improving convergence is 
first noted in [17]. In PDR terminology, inductive lemmas are the ones belonging 
to frame Oo, as they represent an over-approximation of all the reachable states. 

The most relevant related work is [21], where a variant of IC3 named QUIP 
is proposed for implementing the pushing of the discovered lemmas to Og. At 
its essence, QUIP adds the negation of a discovered lemma c as a may-proof- 
obligation, hence trying to push c to the next frame. Counterexamples of may- 
proof-obligations represent an under-approximation of the reachable states and 
are stored to disprove the inductiveness of other lemmas. In QUIP terminology, 
such lemmas are classified as bad lemmas, as they have no chance of being part 
of the inductive invariant. Since the pushing is not limited to the current number 
of frames, inductive lemmas are discovered when all the clauses of a frame can 
be pushed (Ok \ Ok+ı = @ for a level k), and then added in Oæ. In QUIP 
terminology, lemmas belonging to Ox are classified as good lemmas, and are 
always kept during the algorithm. Observe that the concept of good lemma in 
[21] is a stronger version of Definition 1, which instead is local to a frame i and 
characterizes lemmas that can be propagated one frame ahead. 

Both QUIP and our heuristic try to accomplish a similar task, which is prior- 
itizing the use of already discovered lemmas during the generalization. There 
are however several differences: QUIP proceeds by adding additional proof- 
obligations to the queue and by progressively proving the inductiveness of a 
lemma relative to any frame. Our approach, on the other hand, is based on a 
cheap heuristic strategy that locally guides the generalization prioritizing the 
locally good lemmas. Some i-good lemmas computed may not be part of the 
final invariant and can not be pushed later; in QUIP, such lemmas would not be 
considered good. In our view, pushing them is not necessarily a waste of effort, 
because they still strengthen the frames and their presence might be necessary 
to deduce the final invariant. Finally, it is worth mentioning that our heuristic 
is much simpler to implement and integrate into different PDR-based engines. 

The idea of ordering literals when performing inductive generalization is 
already proposed in [11] and adopted, as a default strategy, in several imple- 
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mentations of IC3 [3,17,19], yielding modest improvements on HWMCC bench- 
marks, however without clear trends identified (see [17,19]). Compared to such 
works, our approach has two main differences. First, these heuristics favor literals 
occurring more frequently in all previous frames, whereas our approach is driven 
by the role of lemmas and prefers the variables occurring only in those are i- 
good. Second, our use of ordering heuristics is more pervasive: unlike in previous 
works, where variable ordering heuristics are only used during the lemma gen- 
eralization, we use ordering everywhere the SAT results affect search direction, 
which makes it more effective to bias the search. 


5 Evaluation 


5.1 Experimental Setup 


We integrated the branching and refer-skipping heuristics into three systems: the 
IC3Ref [3] and SimpleCAR [6] (open-source) model checkers, which implement 
the IC3 and (Forward and Backward) CAR algorithms respectively, and the 
mature, state-of-the-art implementation of IC3 available inside the nuXmv model 
checker [12]. We make our implementations and data for reproducing the exper- 
iments available at https://github.com/youyusama/i-Good_Lemmas_MC. 

Since our approach is related to QUIP [21], we include the evaluation of 
QUIP, and IC3 (mainly as the baseline for QUIP), as implemented‘ in IIMC [4]. 
We also consider the PDR implementation in the ABC model checker [1], which 
is state-of-the-art in hardware model checking. 


Table 3. Tools and algorithms evaluated in the experiments. 


Tools Algorithms Available Flags 
IC3Ref [3] IC3 (ic3) -br | -rs | -sh 
SimpleCAR [6] Forward CAR (fear) -br | -rs | -sh 
nuXmv [12] IC3 (nuXmv) -br | -rs | -sh 
IMC [4] QUIP (iimc-quip) = — 

IMC [4] IC3 (iimc-ic3) = 

ABC [1] PDR (abc-pdr) fe 


Table 3 summarizes the tested tools, algorithms, and their flags. We use the 
flag “-br” to enable the branching heuristic and “-rs” to enable refer-skipping. 
Furthermore, we evaluate also another configuration (denoted as “-sh”), in which 
the calls to SORT() functions in Algorithms 4 and 5 are replaced by random 


3 Although there is an implementation of Backward CAR in SimpleCAR, this method- 
ology corresponds to reverse IC3. As a result, we did not include Backward CAR in 
this paper and left the evaluation in future work. 

4 As far as we know, this is the only publicly available QUIP implementation. 
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shuffles, thus simulating a strategy that orders variables randomly. When no flag 
is active, IC3Ref runs the instances with its own strategy of sorting variables, 
present in the original implementation. 

We evaluate all the tools on 749 benchmarks, in aiger format, of the SINGLE 
safety property track of the 2015 and 2017 editions of HWMCC [8]°. We ran the 
experiments on a cluster, which consists of 2304 2.5GHz CPUs in 240 nodes 
running RedHat 4.8.5 with a total of 96GB RAM. For each test, we set the 
memory limit to 8GB and the time limit to 5h. During the experiments, each 
model-checking run has exclusive access to a dedicated node. 

To increase our confidence in the correctness of the results, we compare the 
results of the solvers to make sure they are all consistent (modulo timeouts). 
For the cases with unsafe results, we also check the provided counterexample 
with the aigsim tool from the Aiger package [2]. We have no discrepancies in the 
results, and all unsafe cases successfully pass the aigsim check. 


5.2 Experimental Results 


Overview. The results of the experimental evaluation are discussed below. We 
first consider the aggregated results, as reported in Table4. For each tool, we 
group the results obtained with the various configurations; we report the total 
number of benchmarks solved, distinguishing between safe and unsafe bench- 
marks; we also report the benchmarks gained and lost by the configurations with 
branching and/or refer-skipping active, relative to the baseline where branching 
and refer-skipping are not active. We can draw the following conclusions. 


— The proposed heuristics are in general effective in improving performance. 
Each of the model checkers, with at least one of branching and refer-skipping 
active, consistently outperforms the respective baseline in terms of the number 
of benchmarks solved. 

— The same holds within the safe instances, with the exception of refer-skipping 
in nuXmv that solves two safe benchmarks less than the baseline. 

— The heuristics also yield a uniform improvement over the baseline in the 
unsafe instances. 

— The combination of branching and refer-skipping gives further improvements 
over a single technique, with the exception of nuXmv with branching, which 
cumulatively solves 5 more benchmarks than nuXmv with branching and refer- 
skipping. 

— The gain is not uniform across the instances. For example, nuXmv with branch- 
ing gains 52 benchmarks (44 safe and 8 unsafe) that are not solved by nuXmv 
baseline, while losing 13 (safe) benchmarks. This level of variability can be 
expected, given a heuristic approach, but further investigation is needed to 
assess the underlying phenomena. 


5 From HWMCC 2019, the official format used in the competition is switched from 
Aiger to Btor2 [27], a format for word-level model checking. As a result, we did not 
include those instances in our experiments. 
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— The performance of using the heuristics guided by random variable ordering 
does not differ significantly from the baseline in terms of aggregate results. 
There are some differences (as expected) at the level of individual instances, 


especially for CAR, but no clear trend emerges overall. 


— The comparison also shows that the considered systems compare well against 
the state-of-the-art system ABC, and QUIP; QUIP turns out to be quite inef- 
ficient and is disregarded in the following. Note that the original implemen- 
tation of QUIP is not available; the fact that the available version of QUIP 
implemented on top of IIMC does not seem to achieve the same improvements 
reported in the original paper [21] (the code for which is unfortunately not 
available) suggests that the QUIP is far from trivial to implement. As the 
reference, QUIP performs even worse than the IC3 implementation in IIMC, 
whose performance is similar to the IC3Ref baseline, see Table 4. 


Table 4. Summary of overall results among different configurations. 


Configuration #Solved #Safe #Unsafe Gained(safe/unsafe) Lost(safe/unsafe) 
ic3 -br -rs 439 313 126 25(18/7) 6(4/2) 
ic3 -br 428 302 126 22(15/7) 14(12/2) 
ic3 -rs 430 308 22 21(17/4) 11(8/3) 
ic3 -sh 420 299 21 - E 

ic3 417 297 20 9(7/2) 12(9/3) 
fcar -br -rs 444 319 125 54(43/11) 1(0/1) 
fcar -br 429 308 21 43(33/10) 5(1/4) 
fcar -rs 410 295 15 23(22/1) 4(3/1) 
fcar -sh 394 277 17 31(22/9) 28(21/7) 
fcar 391 276 15 = - 
nuXmv -br -rs 497 353 144 49(39/10) 15(15/0) 
nuXmv -br 502 860 42 52(44/8) 13(13/0) 
nuXmv -rs 473 333 40 26(19/7) 16(15/1) 
nuXmv -sh 464 327 37 7(4/3) 6(6/0) 
nuXmv 463 329 34 - - 
abc-pdr 430 315 15 = = 
iimc-ic3 418 307 11 = = 
iimc-quip 377 281 96 = = 


Similar insights can be obtained from Fig. 1, which clearly shows the positive 


effect of improvements in performance. 


Detailed Statistics. As shown in Table 4 and Fig. 1, nuXmv is highly optimized 
and has a much better performance than other open-source IC3 implementa- 
tions, but enabling both heuristics is still useful to improve its overall perfor- 
mance by solving 34 more instances. For IC3Ref and SimpleCAR, the increased 
numbers of solved cases are 19 and 53, respectively. Moreover, from Table 4, 
nuXmv/IC3Ref/SimpleCAR is able to solve 24/14/43 more safe and 10/5/10 more 
unsafe instances with both heuristics. 
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A comparison of the performance of the tools with and without the heuristics 
is shown in Fig. 2. All three solvers are able to reduce their time cost when equip- 
ping with branching and refer-skipping (see the last row of the figure). Explicitly, 
67.8% of the instances cost less or equal to check by ‘nuXmv -br -rs’, and the 
corresponding portions for ‘ic3 -br -rs’ and ‘fcar -br -rs’ are 77.9% and 87.0%. 
The variability occurs when considering only a single heuristic, which needs to 
be explored in the future. For example, ‘fcar -br’ and ‘nuXmv -rs’ generally cost 
slightly more time than ‘fcar’ and ‘nuXmv’, respectively. 


nuXmv -br -rs 
nuXmv 
fcar -br -rs 


fcar 


#Cases Solved 


ic3 -br -rs 
ic3 


abc-pdr 


0 3600 7200 10800 14400 18000 
CPU Time (s) 


Fig. 1. Comparisons among the implementations of IC3, PDR and CAR under different 
configurations. (To make the figure more readable, we skip the results with a single 
heuristic, which are still shown in Table 4.) 


According to Table 4, either branching or refer-skipping is effective for improv- 
ing nuXmv, IC3Ref, and SimpleCAR. For nuXmv and SimpleCAR, branching is 
more useful, considering that ‘nuXmv -br’ (resp. ‘fear -br’) solves 39 (resp. 38) 
more instances than ‘nuXmv’ (resp. ‘fcar’), with 31 (resp. 32) safe and 8 (resp. 
6) unsafe. For IC3Ref, the improvement with either heuristic seems relatively 
modest, i.e., ‘ic3 -br’ solves 8 more instances than ‘ic3’, with 3 safe and 5 unsafe, 
while ‘ic3 -rs’ solves 10 more instances than ‘ic3’, with 9 safe and 1 unsafe. 

As listed above, ‘ic3 -br -rs’ loses only 6 instances that are solved by ‘ic3’, 
while ‘fcar -br -rs’ even loses only 1 instance that is solved by ‘fear’, which 
indicates the performance domination of ‘fcar -br -rs’ over ‘fcar’. For ‘nuXmv -br 
-rs’, the number of lost cases is 15, which is still modest when compared to the 
gain of 49. So enabling branching and refer-skipping together makes the checkers 
pay a limited cost. The same applies to the situations when equipping with only 
one single heuristic for the checkers, see Table 4. 


5.3. Why Do branching and refer-skipping Work? 


To measure why branching and refer-skipping work, we introduce sr, i.e. the 
success rate in computing i-good lemmas. Formally, sr = N,/N where N; is 


304 Y. Xia et al. 


10000 10000 
1000 1000 
4 
© 
S] 
100 100 
10 1 
10 100 1000 10000 w 100 1000 10000 0 100 1000 10000 
10000 10000 10000 
1000 1000 1000 
H=] ise) 
© 
[è] SS 
100 100 100 
10 wt 10 
10 100 1000 10000 10 100 1000 1000 0 100 1000 1000 
car -rs nuXmv -rs 
10000 10000 10000 
1000 1000 1000 
[=] mM 
© 
i] 5 
100 100 100 
10 10 wt 
10 100 1000 10000 w 100 1000 10000 w 10 1000 10000 
car -br -rs ic3 -br -rs nuXmv -br -rs 


o safe ~ unsafe 


Fig. 2. Time comparison between IC3/CAR with and without two heuristics on safe- 
unsafe cases. The baseline is always on the y-axis. Points above the diagonal indicate 


better performance with the heuristics active. Points on the borders indicate timeouts 
(18000 s). 
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Fig. 3. Comparison on the success rate (sr) to compute i-good lemmas between 
1C3/CAR with and without branching and refer-skipping. 
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the number of generalizations that successfully return i-good lemmas, while N 
is the total number of generalization calls. We instrumented the two open-source 
checkers IC3Ref and SimpleCAR in order to compute sr for each terminating run 
(including each run with/without a returned result at timeout). 


— Consider the results presented in Fig.3. The figure shows the comparison 
of the success rate in computing i-good lemmas between IC3/CAR with and 
without the heuristics. ‘ic3 -br -rs’ computes more i-good lemmas than ‘ic3’ 
on 54% tested instances, while ‘fcar -br -rs’ computes more i-good lemmas 
than ‘fcar’ on 67% tested instances, the portion of which is even higher. 
This supports the conjecture that enabling branching and refer-skipping makes 
IC3/CAR compute more i-good lemmas. 

— Now consider Fig. 4. The figure shows the comparison between the deviation 
of success rate to compute i-good lemmas (Y axis) and the deviation of check- 
ing (CPU) time (X axis) for IC3/CAR with and without the heuristics. The 
meaning of each point in the plot is explained in the title of the figure. In 
general, the more points located in the first quadrant, the better our claim 
can be supported. 

Clearly, the plot for both IC3 and CAR in Fig. 4 supports the conjecture 
that searching more i-good lemmas can help achieve better model-checking 
performance (time cost). 
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Fig. 4. Comparison between the deviation of the success rate (sr) to compute i-good 
lemmas (Y axis) and the deviation of checking (CPU) time (X axis) for IC3/CAR with 
and without the heuristics. For each instance, let the checking time of ‘ic3’/‘fcar’ be 
t and that of ‘ic3 -br -rs’/‘fear -br -rs’ be t’. Each point has t — t’ as the x value and 
sr’ — sr as the y value. 


Finally, we argue that computing as many i-good lemmas as possible is the 
direction to take to improve the performance of IC3 and its variants. branching 
and refer-skipping are two heuristics that can enable IC3/CAR to compute more 
i-good lemmas. However, there can be more efficient ways to compute i-good 
lemmas, which is left for our future work. 
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6 Conclusions and Future Work 


In this paper, we proposed a heuristic-based approach to improve the perfor- 
mance of IC3-based safety model checking. The idea is to steer the search of the 
over-approximation sequence towards 7-good lemmas, i.e. lemmas that can be 
pushed from frame i to frame i+ 1. On the one side, we attempt to control the 
way the SAT solver extracts the unsat cores, by privileging variables occurring 
in i-good lemmas (branching); on the other, we control lemma generalization 
by avoiding dropping literals that occur in a subsuming lemma in the previ- 
ous layer (refer-skipping). The approach is very simple to implement and has 
been integrated into two open-source model checkers and an industrial-strength, 
closed-source model checker. The experimental evaluation, carried out on a wide 
set of benchmarks, shows that the approach yields computational benefits on all 
the implementations. Further analysis shows a correlation between i-good lem- 
mas and performance improvements and suggests that the proposed heuristics 
are effective in finding more 7-good lemmas. 

In the future, we plan to investigate the reasons for performance improve- 
ment/degradation at the level of the single benchmarks. We will also attempt 
to integrate the proposed ideas with the ideas in QUIP, explore different kinds 
of heuristics, and lift this approach to the safety checking of infinite-state sys- 
tems [13, 14]. 
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Abstract. We introduce Hyper?LTL, a temporal logic for the speci- 
fication of hyperproperties that allows for second-order quantification 
over sets of traces. Unlike first-order temporal logics for hyperproperties, 
such as HyperLTL, Hyper?LTL can express complex epistemic prop- 
erties like common knowledge, Mazurkiewicz trace theory, and asyn- 
chronous hyperproperties. The model checking problem of Hyper? LTL 
is, in general, undecidable. For the expressive fragment where second- 
order quantification is restricted to smallest and largest sets, we present 
an approximate model-checking algorithm that computes increasingly 
precise under- and overapproximations of the quantified sets, based on 
fixpoint iteration and automata learning. We report on encouraging 
experimental results with our model-checking algorithm, which we imple- 
mented in the tool HySO. 


1 Introduction 


About a decade ago, Clarkson and Schneider coined the term hyperproperties [21] 
for the rich class of system requirements that relate multiple computations. In 
their definition, hyperproperties generalize trace properties, which are sets of 
traces, to sets of sets of traces. This covers a wide range of requirements, from 
information-flow security policies to epistemic properties describing the knowl- 
edge of agents in a distributed system. Missing from Clarkson and Schneider’s 
original theory was, however, a concrete specification language that could express 
customized hyperproperties for specific applications and serve as the common 
semantic foundation for different verification methods. 

A first milestone towards such a language was the introduction of the tem- 
poral logic HyperLTL [20]. HyperLTL extends linear-time temporal logic (LTL) 
with quantification over traces. Suppose, for example, that an agent i in a dis- 
tributed system observes only a subset of the system variables. The agent knows 
that some LTL formula y is true on some trace m iff y holds on all traces 7’ 
that agent i cannot distinguish from r. If we denote the indistinguishability of 
mand 7’ by m ~; 7’, then the property that there exists a trace 7 where agent i 
knows y can be expressed as the HyperLTL formula 


tn tn, 3 y(n’), 
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where we write y(z’) to denote that the trace property y holds on trace 7’. 

While HyperLTL and its variations have found many applications [28, 32, 44], 
the expressiveness of these logics is limited, leaving many widely used hyperprop- 
erties out of reach. A prominent example is common knowledge, which is used in 
distributed applications to ensure simultaneous action [30,40]. Common knowl- 
edge in a group of agents means that the agents not only know individually that 
some condition y is true, but that this knowledge is “common” to the group in 
the sense that each agent knows that every agent knows that y is true; on top 
of that, each agent in the group knows that every agent knows that every agent 
knows that ọ is true; and so on, forming an infinite chain of knowledge. 

The fundamental limitation of HyperLTL that makes it impossible to express 
properties like common knowledge is that the logic is restricted to first-order 
quantification. HyperLTL, then, cannot reason about sets of traces directly, but 
must always do so by referring to individual traces that are chosen existentially 
or universally from the full set of traces. For the specification of an agent’s indi- 
vidual knowledge, where we are only interested in the (non-)existence of a single 
trace that is indistinguishable and that violates y, this is sufficient; however, 
expressing an infinite chain, as needed for common knowledge, is impossible. 

In this paper, we introduce Hyper?LTL, a temporal logic for hyperproperties 
with second-order quantification over traces. In Hyper*LTL, the existence of a 
trace 7 where the condition y is common knowledge can be expressed as the 
following formula (using slightly simplified syntax): 


dn.dX. mE XA (vr E€ X. Yr”. (V m ~in") >n" E x) A Yr € X.p(n’). 
i=1 


The second-order quantifier 3X postulates the existence of a set X of traces 
that (1) contains 7; that (2) is closed under the observations of each agent, i.e., 
for every trace 7’ already in X, all other traces 7” that some agent i cannot 
distinguish from 7’ are also in X; and that (3) only contains traces that satisfy 
y. The existence of X is a necessary and sufficient condition for y being common 
knowledge on 7. In the paper, we show that Hyper?LTL is an elegant specifi- 
cation language for many hyperproperties of interest that cannot be expressed 
in HyperLTL, including, in addition to epistemic properties like common knowl- 
edge, also Mazurkiewicz trace theory and asynchronous hyperproperties. 

The model checking problem for Hyper?LTL is much more difficult than for 
HyperLTL. A HyperLTL formula can be checked by translating the LTL subfor- 
mula into an automaton and then applying a series of automata transformations, 
such as self-composition to generate multiple traces, projection for existential 
quantification, and complementation for negation [8,32]. For Hyper?LTL, the 
model checking problem is, in general, undecidable. We introduce a method that 
nevertheless obtains sound results by over- and underapproximating the quan- 
tified sets of traces. For this purpose, we study Hyper?LTLg,, a fragment of 
Hyper?LTL, in which we restrict second-order quantification to the smallest or 
largest set satisfying some property. For example, to check common knowledge, 
it suffices to consider the smallest set X that is closed under the observations 
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of all agents. This smallest set X is defined by the (monotone) fixpoint opera- 
tion that adds, in each step, all traces that are indistinguishable to some trace 
already in X. 

We develop an approximate model checking algorithm for Hyper*LTLs, that 
uses bidirectional inference to deduce lower and upper bounds on second-order 
variables, interposed with first-order model checking in the style of HyperLTL. 
Our procedure is parametric in an oracle that provides (increasingly precise) 
lower and upper bounds. In the paper, we realize the oracles with fixpoint itera- 
tion for underapproximations of the sets of traces assigned to the second-order 
variables, and automata learning for overapproximations. We report on encour- 
aging experimental results with our model-checking algorithm, which has been 
implemented in a tool called HySO. 


2 Preliminaries 


For n € N we define [n] := {1,...,n}. We assume that AP is a finite set of 
atomic propositions and define X := 24°. For t € 3” and i € N define t(i) € X 
as the ith element in t (starting with the Oth); and tfi, oo] for the infinite suffix 
starting at position i. For traces t1,...,tn E€ XY we write zip(t1,...,tn) E (2")” 
for the pointwise zipping of the traces, i.e., zip(ti,...,tn)(t) := (t1(2),.-., tr(4)). 


Transition Systems. A transition system is a tuple T = (S, So, x, L) where S is 
a set of states, Sọ C S is a set of initial states, x C S x S is a transition relation, 
and L: S$ — X is a labeling function. A path in T is an infinite state sequence 
808182: E S”, s.t., So E So, and (si, Si+1) € K for all i. The associated trace is 
given by L(so)L(s1)L(s2)--- € X” and Traces(T) C X” denotes all traces of T. 


Automata. A non-deterministic Büchi automaton (NBA) [18] is a tuple A = 
(X, Q, 40,0, F) where X is a finite alphabet, Q is a finite set of states, Qo C Q is 
the set of initial states, F C Q is a set of accepting states, and ô : Q x X > 28 
is the transition function. A run on a word u € X® is an infinite sequence of 
states goqig2::: E Q” such that qo € Qo and for every i € N, qi+1 € ôlqi, u(i)). 
The run is accepting if it visits states in F infinitely many times, and we define 
the language of A, denoted L(A) C X®, as all infinite words on which A has an 
accepting run. 


HyperLTL. HyperLTL [20] is one of the most studied temporal logics for the 
specification of hyperproperties. We assume that V is a fixed set of trace vari- 
ables. For the most part, we use variations of m (e.g., 7,7’,71,...) to denote 
trace variables. HyperLTL formulas are then generated by the grammar 


p := Qr. y | Y 
Y := ar || YA y| Oy |YU p 


where a € AP is an atomic proposition, 7 € V is a trace variable, Q € {V, 5} is 
a quantifier, and O and U are the temporal operators neszt and until. 
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The semantics of HyperLTL is given with respect to a trace assignment IT, 
which is a partial mapping JJ : V — X“ that maps trace variables to traces. 
Given 7 E V and t € XY” we define [| + t] as the updated assignment that 
maps 7 to t. For i € N we define IZ[i,oo] as the trace assignment defined by 
IT{i, o0|(7) := IH (2) [2, co], i.e., we (synchronously) progress all traces by i steps. 
For quantifier-free formulas 7 we follow the LTL semantics and define 


TF ar iff ae I(7)(0) 

TI E =) if MEY 

Ted Ade. iff HE y and ITE yz 

ITEOW iff II[l,co] Fw 

HE Up iff JieN. Mji, o] E Yz and Yj < i. Ilj, o0] E y . 


The indexed atomic propositions refer to a specific path in H, i.e., a, holds iff 
a holds on the trace bound to 7. Quantifiers range over system traces: 


I Er y if HE y and II Fr Qr. ¢ iff Qt € Traces(T). I[r > t] F ọ. 


We write T F » if Ø Er y where Ø denotes the empty trace assignment. 


HyperQPTL. HyperQPTL [45] adds — on top of the trace quantification of 
HyperLTL - also propositional quantification (analogous to the propositional 
quantification that QPTL [46] adds on top of LTL). For example, HyperQPTL 
can express a promptness property which states that there must exist a bound 
(which is common among all traces), up to which an event must have happened. 
We can express this as 3q.Yr. Qq A (~q) U ar which states that there exists an 
evaluation of proposition q such that (1) q holds at least once, and (2) for all 
traces 7, a holds on 7 before the first occurrence of q. See [8] for details. 


3 Second-Order HyperLTL 


The (first-order) trace quantification in HyperLTL ranges over the set of all sys- 
tem traces; we thus cannot reason about arbitrary sets of traces as required for, 
e.g., common knowledge. We introduce a second-order extension of HyperLTL 
by introducing second-order variables (ranging over sets of traces) and allowing 
quantification over traces from any such set. We present two variants of our logic 
that differ in the way quantification is resolved. In Hyper?LTL, we quantify over 
arbitrary sets of traces. While this yields a powerful and intuitive logic, second- 
order quantification is inherently non-constructive. During model checking, there 
thus does not exist an efficient way to even approximate possible witnesses for the 
sets of traces. To solve this quandary, we restrict Hyper*LTL to Hyper?’ LTL, 
where we instead quantify over sets of traces that satisfy some minimality or 
maximality constraint. This allows for large fragments of Hyper*LTL», that 
admit algorithmic approximations to its model checking (by, e.g., using known 
techniques from fixpoint computations [47,48]). 
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3.1 Hyper?LTL 


Alongside the set V of trace variables, we use a set Y of second-order variables 
(which we, for the most part, denote with capital letters X,Y, ...). We assume 
that there is a special variable G € U that refers to the set of traces of the given 
system at hand, and a variable 2 € Y that refers to the set of all traces. We 
define the Hyper?LTL syntax by the following grammar: 


g:=QreX.y|QX.y|y 
Y :=ar| Y| YA y| Oy |YU p 


where a € AP is an atomic proposition, m € V is a trace variable, X € %9 is a 
second-order variable, and Q € {V, 3} is a quantifier. We also consider the usual 
derived Boolean constants (true, false) and connectives (V, >, +) as well as the 
temporal operators eventually (Ow := trueUw) and globally (OY := = O ~). 
Given a set of atomic propositions P C AP and two trace variables 7,7’, we 
abbreviate 7 =p 7’ := \,ep(an © Gn’). 


Semantics. Apart from a trace assignment J (as in the semantics of Hyper- 
LTL), we maintain a second-order assignment A : Y — 2*” mapping second- 
order variables to sets of traces. Given X € Y and A C X” we define the updated 
assignment A[X +> A] as expected. Quantifier-free formulas w are then evalu- 
ated in a fixed trace assignment as for HyperLTL (cf. Sect. 2). For the quantifier 
prefix we define: 


IH, AF y% if HEY 
I, AFE Qr E€ X. 9 if Qte A(X). Dlr t], Ary 
IT, AF QX.» if QACS” T,A|X => AJEy 


Second-order quantification updates A with a set of traces, and first-order quan- 
tification updates JI by quantifying over traces within the set defined by A. 

Initially, we evaluate a formula in the empty trace assignment and fix the 
valuation of the special second-order variable G to be the set of all system traces 
and 2 to be the set of all traces. That is, given a system 7 and Hyper?LTL 
formula y, we say that T satisfies y, written T F y, if 0,[G > Traces(T), A > 
XY] F p, where we write Ø for the empty trace assignment. The model-checking 
problem for Hyper?LTL is checking whether 7 F y holds. 

Hyper?LTL naturally generalizes HyperLTL by adding second-order quantifi- 
cation. As sets range over arbitrary traces, Hyper?LTL also subsumes the more 
powerful logic HyperQPTL. The proof of Lemma 1 is given in the full version of 
this paper [11]. 


Lemma 1. Hyper?LTL subsumes HyperQPTL (and thus also HyperLTL). 
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Syntactic Sugar. In Hyper?LTL, we can quantify over traces within a second- 
order variable, but we cannot state, within the body of the formula, that some 
path is a member of some second-order variable. For that, we define m > X (as 
an atom within the body) as syntactic sugar for dz’ € X.O(m =ap T), i.e., 7 
is in X if there exists some trace in X that agrees with m on all propositions. 
Note that we can only use 7> X outside of the scope of any temporal operators; 
this ensures that we can bring the resulting formula into a form that conforms 
to the Hyper?LTL syntax. 


3.2 Hyper?LTLy, 


The semantics of Hyper?LTL quantifies over arbitrary sets of traces, making 
even approximations to its semantics challenging. We propose Hyper? LTL as 
a restriction that only quantifies over sets that are subject to an additional 
minimality or maximality constraint. For large classes of formulas, we show that 
this admits effective model-checking approximations. We define Hyper? LTL by 
the following grammar: 


p:=QrEX.y|Q(X, x, y) p| Y 
Y :=ar| Y| Yay Oy| YUY 


where a € AP, m € V, X € Y, Q € {V,F}, and X € {A, Y } determines if we con- 
sider smallest (Y) or largest (A) sets. For example, the formula 3 (X, Y, v1). Y2 
holds if there exists some set of traces X, that satisfies both yı and yo, and 
is a smallest set that satisfies yı. Such minimality and maximality constraints 
with respect to a (hyper)property arise naturally in many properties. Exam- 
ples include common knowledge (cf. Sect. 3.3), asynchronous hyperproperties 
(cf. Sect. 4.2), and causality in reactive systems [22,23]. 


Semantics. For path formulas, the semantics of Hyper*LTL¢, is defined analo- 
gously to that of Hyper? LTL and HyperLTL. For the quantifier prefix we define: 


IH, AE y iff MEY 
I, AF QreEexX.g if Qte A(X). Hinr = t], AE ọ 
IT, AF Q(X, x, y1ı).-y2 iff QA € sol(II, A, (X, xX, 91)). I, A[X => A| E yo 


where sol(II, A, (X, X, p1)) denotes all solutions to the minimality /maximality 
condition given by y1, which we define by mutual recursion as follows: 


sol(IT, A, (X,Y, p)) = {AC X” | IH, A[X = A] E p AYA G A. I, A[X = A’ F o} 
sol(IT, A, (X, 4, p)) = {AC X” | IH, A[X = AJE p AYA 9 A.I, A[X = A'] F o} 


A set A satisfies the minimality/maximality constraint if it satisfies y and 
is a least (in case X = Y) or greatest (in case X = A) set that satisfies y. 
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Fig. 1. Left: An example for a multi-agent system with two agents, where agent 1 
observes a and d, and agent 2 observes c and d. Right: The iterative construction of 
the traces to be considered for common knowledge starting with a” d”. 


Note that sol(IT, A,(X,xX,yv)) can contain multiple sets or no set at all, 
i.e., there may not exists a unique least or greatest set that satisfies y. In 
Hyper?LTL¢,, we therefore add an additional quantification over the set of all 
solutions to the minimality/maximality constraint. When discussing our model 
checking approximation algorithm, we present a (syntactic) restriction on y 
which guarantees that sol(I7, A,(X,X,y)) contains a unique element (i.e., is a 
singleton set). Moreover, our restriction allows us to employ fixpoint techniques 
to find approximations to this unique solution. In case the solution for (X, X, y) 
is unique, we often omit the leading quantifier and simply write (X, X, p) instead 
of Q(X, x, p). 

As we can encode the minimality/maximality constraints of Hyper*LTL¢, in 
Hyper?LTL (see full version [11]), we have the following: 


Proposition 1. Any Hyper*LTL¢, formula p can be effectively translated into 
an Hyper?LTL formula y' such that for all transition systems T we have T F p 
of TE g. 


3.3 Common Knowledge in Multi-agent Systems 


To explain common knowledge, we use a variation of an example from [43], and 
encode it in Hyper*LTLgp. Fig. 1(left) shows a transition system of a distributed 
system with two agents, agent 1 and agent 2. Agent 1 observes variables a and 
d, whereas agent 2 observes c and d. The property of interest is starting from the 
trace T = ad” for some fixed n > 1, is it common knowledge for the two agents 
that a holds in the second step. It is trivial to see that Oa holds on 7. However, for 
common knowledge, we consider the (possibly) infinite chain of observationally 
equivalent traces. For example, agent 2 cannot distinguish the traces a”d” and 
a”~'bd”. Therefore, agent 2 only knows that Oa holds on ~ if it also holds on 
n! = a"~'bd”. For common knowledge, agent 1 also has to know that agent 2 
knows Oa, which means that for all traces that are indistinguishable from m 
or 7’ for agent 1, Oa has to hold. This adds 7” = a”~'cd” to the set of traces to 
verify Oa against. This chain of reasoning continues as shown in Fig. 1 (right). In 
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the last step we add ac”~‘d” to the set of indistinguishable traces, concluding 
that Oa is not common knowledge. 

The following Hyper?LT L¢p formula specifies the property stated above. The 
abbreviation obs(m,72) := H(t =fa,a} 72) VO(m1 =fc,a} T2) denotes that mı 
and 72 are observationally equivalent for either agent 1 or agent 2. 


n-1 


vr € 6.( N O'ar ^O" Od) > 


i=0 
(x, Y, TXA (Yr E XV € ©. obs(T1, 72) > T2 > X) )- vr” E X.Oaxv 


For a trace m of the form m = a”d”, the set X represents the common 
knowledge set on 7. This set X is the smallest set that (1) contains m (expressed 
using our syntactic sugar >); and (2) is closed under observations by either agent, 
i.e., if we find some 7, € X and some system trace 72 that are observationally 
equivalent, 72 should also be in X. Note that this set is unique (due to the 
minimality restriction), so we do not quantify it explicitly. Lastly, we require 
that all traces in X satisfy the property Oa. All sets that satisfy this formula 
would also include the trace ac”~'d”, and therefore no such X exists; thus, we 
can conclude that starting from trace a”d”, it is not common knowledge that 
Oa holds. On the other hand, it is common knowledge that a holds in the first 
step (cf. Sect. 6). 


3.4 Hyper?LTL Model Checking 


As Hyper?LTL and Hyper*LTLs, allow quantification over arbitrary sets of 
traces, we can encode the satisfiability of HyperQPTL (i.e., the question of 
whether some set of traces satisfies a formula) within their model-checking prob- 
lem; rendering the model-checking problem highly undecidable [34], even for very 
simple formulas [4]. 


Proposition 2. For any HyperQPTL formula ọ there exists a Hyper?LTL for- 
mula y' such that ọ is satisfiable iff yp’ holds on some arbitrary transition system. 
The model-checking problem of Hyper?LTL is thus highly undecidable (37}-hard). 


Proof. Let yp’ be the Hyper?LTL formula obtained from y by replacing each 
HyperQPTL trace quantifier Qr with the Hyper?LTL quantifier Qr € X, and 
each propositional quantifier Qq with Qrq € A for some fresh trace variable rq. 
In the body, we replace each propositional variable q with a,, for some fixed 
proposition a € AP. Then, ¢ is satisfiable iff the Hyper?LTL formula 1X.’ 
holds in some arbitrary system. 


Hyper?LT Lfp cannot express HyperQPTL satisfiability directly. If there 
exists a model of a HyperQPTL formula, there may not exist a least one. How- 
ever, model checking of Hyper?LTL¢, is also highly undecidable. 


Proposition 3. The model-checking problem of Hyper? LTLe is Xf -hard. 
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Proof (Sketch). We can encode the existence of a recurrent computation of a 
Turing machine, which is known to be Y}-hard [1]. 


Conversely, the existential fragment of Hyper*LTL can be encoded back into 
HyperQPTL satisfiability: 


Proposition 4. Let p be a Hyper?LTL formula that uses only existential 
second-order quantification and T be any system. We can effectively construct a 
formula y' in HyperQPTL such that T F » iff y’ is satisfiable. 


Lastly, we present some easy fragments of Hyper?LTL for which the model- 
checking problem is decidable. Here we write 4* X (resp. V* X) for some sequence 
of existentially (resp. universally) quantified second-order variables and 3*r 
(resp. V*7) for some sequence of existentially (resp. universally) quantified 
first-order variables. For example, 4* XV*z captures all formulas of the form 


4Xy,...Xn.Vm1,.--,7m-W where w is quantifier-free. 


Proposition 5. The model-checking problem of Hyper?LTL is decidable for the 
fragments: 3° XVin, V* XY* n, IX Fn, VX a, IX.I*n € XVir' EX. 


We refer the reader to the full version [11] for detailed proofs. 


4 Expressiveness of Hyper? LTL 


In this section, we point to existing logics that can naturally be encoded within 
our second-order hyperlogics Hyper? LTL and Hyper? LTLrp. 


4.1 Hyper?LTL and LTLp,c 


LTLk extends LTL with the knowledge operator K. For some subset of agents 
A, the formula Kay holds in timestep i, if ~ holds on all traces equivalent to 
some agent in A up to timestep i. See full version [11] for detailed semantics. 
LTL« and HyperCTL* have incomparable expressiveness [16] but the knowledge 
operator K can be encoded by either adding a linear past operator [16] or by 
adding propositional quantification (as in HyperQPTL) [45]. 

Using Hyper*LTL¢, we can encode LTL«x.c, featuring the knowledge operator 
K and the common knowledge operator C (which requires that w holds on the 
closure set of equivalent traces, up to the current timepoint) [41]. Note that 
LTLx,c is not encodable by only adding propositional quantification or the linear 
past operator. 


Prepostiion 6. For every LTLk,c formula ọ there exists an Hyper” LTL for- 
mula yp’ such that for any system T we have T Frrix, Y if TE g. 
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Proof (Sketch). We follow the intuition discussed in Sect. 3.3. For each occur- 
rence of a knowledge operator in {K,C}, we use a fresh trace variable to keep 
track on the points in time with respect to which we need to compare traces. 
We then use this trace variable to introduce a second-order set that collects all 
equivalent traces (by the observations of one agent, or the closure of all agents’ 
observations). We then inductively construct a Hyper*LTL¢, formula that cap- 
tures all the knowledge and common-knowledge sets, over which we check the 
properties at hand. See full version for more details [11]. 


4.2 Hyper?LTL and Asynchronous Hyperproperties 


Most existing hyperlogics (including Hyper?LTL) traverse the traces of a sys- 
tem synchronously. However, in many cases such a synchronous traversal is too 
restricting and we need to compare traces asynchronously. As an example, con- 
sider observational determinism (OD), which we can express in HyperLTL as 
Pop :=Vm.Vr2.O(07,  On,). The formula states that the output of a system 
is identical across all traces and so (trivially) no information about high-security 
inputs is leaked. In most systems encountered in practice, this synchronous for- 
mula is violated, as the exact timing between updates to o might differ by a 
few steps (we provide some examples in the full version [11]). However, assum- 
ing that an attacker only has access to the memory footprint and not a timing 
channel, we would only like to check that all traces are stutter equivalent (with 
respect to 0). 

A range of extensions to existing hyperlogics has been proposed to reason 
about such asynchronous hyperproperties [3,5,9, 17,39]. We consider AHLTL [3]. 
An AHLTL formula has the form Q;7,...,Qn7m-E.w where w is a qunatifier- 
free HyperLTL formula. The initial trace quantifier prefix is handled as in Hyper- 


LTL. However, different from HyperLTL, a trace assignment [71 + t1,...,7n > 
tn] satisfies E. y% if there exist stuttered traces t,...,t), of t1,...,t, such that 
[m1 > th,...,1m > UL] E y. We write T FAHLTL Q if a system T satis- 


fies the AHLTL formula vy. Using this quantification over stutterings we can, 
for example, express an asynchronous version of observational determinism as 
Viy.V2.E.O(07, © Or) stating that every two traces can be aligned such 
that they (globally) agree on o. Despite the fact that Hyper?LTLg, is itself 
synchronous, we can use second-order quantification to encode asynchronous 
hyperproperties, as we state in the following proposition. 


Proposition 7. For any AHLTL formula ọ there exists a Hyper*LTL¢, formula 
vy’ such that for any system T we have T Fanpr, Y iff TF g. 


Proof. Assume that y = Q)7,.--,Qn7n-E. Y is the given AHLTL formula. For 
each i € [n] we define a formula y; as follows 


Yri € Xi To € A. 


((m =ap 72) U ( (T1 =ap T2) A \ On, + Oam) ) > 12> Xi 
ac AP 
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The formula asserts that the set of traces bound to X; is closed under stuttering, 
i.e., if we start from any trace in X; and stutter it once (at some arbitrary 
position) we again end up in X;. Using the formulas y;, we then construct a 
Hyper?LTL¢, formula that is equivalent to y as follows 


yg’ :=Qim €G,...,Qntm E€ G.(X1, Y, m1 > X1 A 91) ++ (Xn, Y, Tn > Xn A Yn) 
Iri € X1,..., In, E€ Xn Yiri /T, -T /Tn] 


We first mimic the quantification in y and, for each trace 7;, construct a least 
set X; that contains 7; and is closed under stuttering (thus describing exactly 
the set of all stuttering of 7;). Finally, we assert that there are traces 7{,..., 7), 
with zi € X; (so 7; is a stuttering of 7;) such that 71,...,7/, satisfy ~. It is easy 
to see that T Fanzrz Y iff T E vy’ holds for all systems. 


Hyper?LTL¢, captures all properties expressible in AHLTL. In particular, our 
approximate model-checking algorithm for Hyper”LTL¢, (cf. Sect. 5) is applica- 
ble to AHLTL; even for instances where no approximate solutions were previously 
known. In Sect. 6, we show that our prototype model checker for Hyper?LT Lip 
can verify asynchronous properties in practice. 


5 Model-Checking Hyper?’LTL p 


In general, finite-state model checking of Hyper? LTLýẹ is highly undecidable 
(cf. Proposition 2). In this section, we outline a partial algorithm that com- 
putes approximations on the concrete values of second-order variables for a frag- 
ment of Hyper?LTLe. At a very high-level, our algorithm (Algorithm 1) iter- 
atively computes under- and overapproximations for second-order variables. It 
then turns to resolve first-order quantification, using techniques from HyperLTL 
model checking [8,32], and resolves existential and universal trace quantification 
on the under- and overapproximation of the second-order variables, respectively. 
If the verification fails, it goes back to refine second-order approximations. 

In this section, we focus on the setting where we are interested in the least 
sets (using Y), and use techniques to approximate the least fixpoint. A similar 
(dual) treatment is possible for Hyper?LTL¢, formulas that use the largest set. 
Every Hyper*LTL¢p which uses only minimal sets has the following form: 


p = HAY, oR) Vases (Yk Oe J eae (1) 


We quantify second-order variables Y;,...,Y,;, where, for each j € [k], Y; is the 
least set that satisfies p$°". Finally, for each j € [k + 1], 


Yi = Qo 44 E Aas Oi i E Ata 


is the block of first-order quantifiers that sits between the quantification of Yj_1 
and Yj. Here X7,+41,..-,X14, € {6,4,V1,...,Y;-1} are second-order variables 
that are quantified before yj. In particular, 71, . . . , mı; are the first-order variables 
quantified before Y;. 
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5.1 Fixpoints in Hyper*LTLs, 


We consider a fragment of Hyper*LTL, which we call the least fixpoint frag- 
ment. Within this fragment, we restrict the formulas yf°”,...,y;°" such that 
Yı,..., Yp can be approximated as (least) fixpoints. Concretely, we say that » 
is in the least fixpoint fragment of Hyper*LTLg, if for all j € [k], pi” isa 
conjunction of formulas of the form 


Vii © X1... -Vin E Xn- Vstep > tm > Yj (2) 
where each X; € {6,2,Y1,..., Yj}, Ystep is quantifier-free formula over trace 
variables f1, .. ., fn, T1,- - -Tı , and M € [n]. Intuitively, Eq. (2) states a require- 


ment on traces that should be included in Y;. If we find traces #; € X1,...,tn € 
Xn that, together with the traces ¢,,...,¢,, quantified before Yj, satisfy Wstep, 
then tm should be included in Y;. 

Together with the minimality constraint on Y; (stemming from the semantics 
of Hyper?LTL¢,), this effectively defines a (monotone) least fixpoint computa- 
tion, as Wstep defines exactly the traces to be added to the set. This will allow us 
to use results from fixpoint theory to compute approximations for the sets Yj. 

Our least fixpoint fragment captures most properties of interest, in particular, 
common knowledge (Sect. 3.3) and asynchronous hyperproperties (Sect. 4.2). We 
observe that formulas of the above form ensure that the solution Y; is unique, 
i.e., for any trace assignment JT to 7,...,7, and second-order assignment A 
to G,4,Yi,...,¥j-1, there is only one element in sol(II, A, (Y;, Y,y}°")). 


5.2 Functions as Automata 


In our (approximate) model-checking algorithm, we represent a concrete assign- 
ment to the second-order variables Y1,..., Yp using automata By,,...,By,. The 
concrete assignment of Y; can depend on traces assigned to 7,...,7,, 1€., 
the first-order variables quantified before Y;. To capture these dependencies, 
we view each Y; not as a set of traces but as a function mapping traces of all 
preceding first-order variables to a set of traces. We represent such a function 
fey = 2(*") mapping the lj traces to a set of traces as an automaton 
A over S4i+1, For traces ty,... „tı, the set f(t1,...,ti,) is represented in the 
automaton by the set {t € X“ | zip(ti,...,t,,t) E€ L(A)}. For example, the 
function f(t:) := {ti} can be defined by the automaton that accepts the zipping 
of a pair of traces exactly if both traces agree on all propositions. This repre- 
sentation of functions as automata allows us to maintain an assignment to Y; 
that is parametric in 7,...,7 , and still allows first-order model checking on 
Vijesig Y ka 


5.3 Model Checking for First-Order Quantification 


First, we focus on first-order quantification, and assume that we are given a con- 
crete assignment for each second-order variable as fixed automata By,,..., By, 
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(where By, is an automaton over 3i+1), Our construction for resolving first- 
order quantification is based on HyperLTL model checking [32], but needs to 
work on sets of traces that, themselves, are based on traces quantified before 
(cf. Sect. 5.2). Recall that the first-order quantifier prefix is y1 +++ Yk+1 = Qum E 
Xite Qa Tar © Xt,4,. For each 1 < i < [p41 we inductively construct an 
automaton A; over X*~! that summarizes all trace assignments to 7,...,7j—1 
that satisfy the subformula starting with the quantification of 7;. That is, for all 
traces t),...,t;-1; we have 


[mi A tty eS Go) F Qiri SA OD Tia E Xe 


(under the fixed second-order assignment for Y),..., Yp given by By,,..., By,) if 
and only if zip(ti,...,ti-1) E€ £(A;). In the context of HyperLTL model check- 
ing we say A; is equivalent to Qimi € Xi- >: Qhyai Mgr E Xing. [8,32]. In 
particular, A; is an automaton over singleton alphabet X°. 

We construct A;,...,Aj,,,+41 inductively, starting with Aj, ,, +1. Initially, we 
construct Aj), ,, +1 (over S"*+1) using a standard LTL-to-NBA construction on 
the (quantifier-free) body w (see [32] for details). Now assume that we are given 
an (inductively constructed) automaton A;,1 over ©” and want to construct A;. 
We first consider the case where Q; = J, i.e., the ith trace quantification is 
existential. Now X; (the set where 7; is resolved on) either equals G, X or Y; 
for some j € [k]. In either case, we represent the current assignment to X; as 
an automaton C over 37+! for some T < i that defines the model of X; based 
on traces 7,...,77: In case X; = G, we set C to be the automaton over 50+! 
that accepts exactly the traces in the given system J; in case X; = 2, we set 
C to be the automaton over Y°*! that accepts all traces; If X; = Y; for some 
j € [k] we set C to be By, (which is an automaton over ¥/+').! Given C, we can 
now modify the construction from [32], to resolve first-order quantification: The 


desired automaton A; should accept the zipping of traces t,,...,t;-1 if there 
exists a trace t such that (1) zip(ti,...,ti-1,t) E€ L(Ai+1), and (2) the trace t is 
contained in the set of traces assigned to X; as given by C, i.e., zip(t,,...,tr,t) € 


L(C). The construction of this automaton is straightforward by taking a product 
of Aj41 and C. We denote this automaton with eProduct (A;11,C). In case Q; = V 
we exploit the duality that Va.~ = 74da.7y, combining the above construction 
with automata complementation. We denote this universal product of A;,, and 
C with uProduct(Aj41,C). 

The final automaton A; is an automaton over singleton alphabet X° that is 
equivalent to 71 +--+ Yk+1-Y, i-e., the entire first-order quantifier prefix. Automaton 
A, thus satisfies £(A,) 4 Ø (which we can decide) iff the empty trace assignment 
satisfies the first-order formula 71 -- +41. Y, iff y (of Eq. (1)) holds within the 


fixed model for Y1,...,Yp. For a given fixed second-order assignment (given as 
automata By,,...,By,,), we can thus decide if the system satisfies the first-order 
part. 


1 Note that in this case lj < i: if trace m; is resolved on Y; (i.e., X; = Yj), then Y; 
must be quantified before 7; so there are at most i — 1 traces quantified before Y;. 
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Algorithm 1 


1 verify(y, T) = 
k i 
let yp = [y CARR Vet Y where qi = [Qmam E€ Xe) +1 


N 


3 let N=0 
4 let Ar = systemToNBA(7) 
5 repeat 
6 // Start outside-in traversal on second-order variables 
7 let b = [6 r+ (AT, AT), A= (AT, AT)|] 
8 for j from 1 to k do 
9 Bi := underApprox((Y;, ¥,yj°"),b,N) 
o B; $= avarippra( (i. Yow gon), b, N) 
l 
1 b(¥;) := (Bi, BY) 
2 // Start inside-out traversal on first-order variables 
3 let Alngitl = LTLtoNBA(w) 
4 for i from lk}; to 1 do 
5 tet (C',C%) = b(Xi) 
6 if Q; = then 
7 A; := eProduct(Aj41, Cc!) 
8 else 
9 A; := uProduct(Aj11, C”) 


N 
(e; 


if £(Ai1) #0 then 


21 return SAT 
22 else 
23 N=N+1 


During the first-order model-checking phase, each quantifier alternations in 
the formula require complex automata complementation. For the first-order 
phase, we could also use cheaper approximate methods by, e.g., instantiating 
the existential trace using a strategy [6,7,25]. 


5.4 Bidirectional Model Checking 


So far, we have discussed the verification of the first-order quantifiers assuming 
we have a fixed model for all second-order variables Y1,..., Yp. In our actual 
model-checking algorithm, we instead maintain under- and overapproximations 
on each of the Y1,..., Yp. 

In each iteration, we first traverse the second-order quantifiers in an outside- 
in direction and compute lower- and upper-bounds on each Y;. Given the bounds, 
we then traverse the first-order prefix in an inside-out direction using the cur- 
rent approximations to Y1,..., Yp. If the current approximations are not precise 
enough to witness the satisfaction (or violation) of a property, we repeat and 
try to compute better bounds on Y4,..., Yp. Due to the different directions of 
traversal, we refer to our model-checking approach as bidirectional. Algorithm 1 
provides an overview. Initially, we convert the system 7 to an NBA Az accept- 
ing exactly the traces of the system. In each round, we compute under- and 
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overapproximations for each Y; in a mapping b. We initialize b by mapping G to 
(Ar, Ar) (i-e., the value assigned to the system variable is precisely Az for both 
under- and overapproximation), and 2% to (A7, Ar) where Ary is an automa- 
ton over Xt accepting all traces. We then traverse the second-order quantifiers 
outside-in (from Y; to Yp) and for each Y; compute a pair (Bi, BY) of automata 
over X+! that under- and overapproximate the actual (unique) model of Y}. 
We compute these approximations using functions underApprox and overApprox, 
which can be instantiated with any procedure that computes sound lower and 
upper bounds (see Sect. 5.5). During verification, we further maintain a precision 
bound N (initially set to 0) that tracks the current precision of the second-order 
approximations. 

When b contains an under- and overapproximation for each second-order vari- 
able, we traverse the first-order variables in an inside-out direction (from 7, ,, 
to 71) and, following the construction outlined in Sect. 5.3, construct automata 
Aj,41,---,A1. Different from the simplified setting in Sect. 5.3 (where we assume 
a fixed automaton By, providing a model for each Y;), the mapping > contains 
only approximations of the concrete solution. We choose which approximation 
to use according to the corresponding set quantification: In case we construct 
A; and Q; = J, we use the underapproximation (thus making sure that any wit- 
ness trace we pick is indeed contained in the actual model of the second-order 
variable); and if Q; = V, we use the overapproximation (making sure that we 
consider at least those traces that are in the actual solution). If £(A,) is non- 
empty, i.e., accepts the empty trace assignment, the formula holds (assuming 
the approximations returned by underApprox and overApprox are sound). If not, 
we increase the precision bound N and repeat. 

In Algorithm 1, we only check for the satisfaction of a formula (to keep the 
notation succinct). Using the second-order approximations in b we can also check 
the negation of a formula (by considering the negated body and dualizing all 
trace quantifiers). Our tool (Sect. 6) makes use of this and thus simultaneously 
tries to show satisfaction and violation of a formula. 


5.5 Computing Under- and Overapproximations 


In this section we provide concrete instantiations for underApprox and overApprox. 


Computing Underapproximations. As we consider the fixpoint fragment, 
each formula 9°" (defining Y;) is a conjunction of formulas of the form in Eq. 
(2), thus defining Y; via a least fixpoint computation. For simplicity, we assume 
that Y; is defined by the single conjunct, given by Eq. (2) (our construction 
generalizes easily to a conjunction of such formulas). Assuming fixed models for 
G, A and Yj,...,¥j-1, the fixpoint operation defining Y; is monotone, i.e., the 
larger the current model for Y; is, the more traces we need to add according to 
Eq. (2). Monotonicity allows us to apply the Knaster-Tarski theorem [47] and 
compute underapproximations to the fixpoint by iteration. 
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In our construction of an approximation for Y;, we are given a mapping p 
that fixes a pair of automata for G, 2, and Y1,...,Yj—ı (due to the outside- 
in traversal in Algorithm 1). As we are computing an underapproximation, 
we use the underapproximation for each of the second-order variables in b. 
So b(G) and b(2) are automata over X! and for each j’ € [j — 1], b(Yj) 
is an automaton over Xi’ +1, Given this fixed mapping b, we iteratively con- 
struct automata Co,C;,... over Xt! that capture (increasingly precise) under- 
approximations on the solution for Y;. We set Gi to be the automaton with 
the empty language. We then recursively define Ê N+1 based on Cy as fol- 
lows: For each second-order variable X; for i € [n] used in Eq. (2) we can 
assume a concrete assignment in the form of an automaton D; over 7+! 
for some T; < lj: In case X; # Yj (so X; € {6,2,V1,...,¥j-1}), we set 
D; := b(X;). In case X; = Yj, we set D; := Ce i.e., we use the current 
approximation of Y; in iteration N. After we have set D),...,D,, we com- 
pute an automaton C over Xli+! that accepts zip(ti,...,ti,,t) iff there exists 
traces t,,...,, such that (1) zip(t,,...,t7,,t;) € L(D;) for all i € [n], (2) 
[mi = t1,..., 1, > tti = Èi., in = tn] F Ystep, and (3) trace t equals 
im (of Eq. (2)). The intuition is that C captures all traces that should be added 
to Yj: Given t1,...,¢1, we check if there are traces ti, ...,t» for trace variables 
71,+++,7m in Eq. (2) where (1) each t; is in the assignment for X;, which is 
captured by the automaton D; over Y7'+1, and (2) the traces ¢1,...,f, satisfy 
Ystep- If this is the case, we want to add im (as stated in Eq. (2)). We then 
define C N+1 as the union of Cn and Ê , ie. extend the previous model with all 
(potentially new) traces that need to be added. 


Computing Overapproximations. As we noted above, conditions of the form 
of Eq. (2) always define fixpoint constraints. To compute upper bounds on such 
fixpoint constructions we make use of Park’s theorem, [48] stating that if we 
find some set (or automaton) 6 that is inductive (i.e., when computing all traces 
that we would need to add assuming the current model of Y; is B, we end up 
with traces that are already in B), then B overapproximates the unique solu- 
tion (aka. least fixpoint) of Y;. To derive such an inductive invariant, we employ 
techniques developed in the context of regular model checking [15] (see Sect. 7). 
Concretely, we employ the approach from [19] that uses automata learning [2] to 
find suitable invariants. While the approach from [19] is limited to finite words, 
we extend it to an w-setting by interpreting an automaton accepting finite words 
as one that accepts an w-word u iff every prefix of u is accepted.” As soon as 
the learner provides a candidate for an equivalence check, we check that it is 
inductive and, if not, provide some finite counterexample (see [19] for details). 
If the automaton is inductive, we return it as a potential overapproximation. 


? This effectively poses the assumption that the step formula specifies a safety prop- 
erty, which seems to be the case for almost all examples. As an example, common 
knowledge infers a safety property: In each step, we add all traces for which there 
exists some trace that agrees on all propositions observed by that agent. 
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Should this approximation not be precise enough, the first-order model checking 
(Sect. 5.3) returns some concrete counterexample, i.e., some trace contained in 
the invariant but violating the property, which we use to provide more coun- 
terexamples to the learner. 


6 Implementation and Experiments 


We have implemented our model-checking algorithm in a prototype tool we call 
HySO (Hyperproperties with Second Order).* Our tool uses spot [29] for basic 
automata operations (such as LTL-to-NBA translations and complementations). 
To compute under- and overapproximations, we use the techniques described in 
Sect. 5.5. We evaluate the algorithm on the following benchmarks. 


Muddy Children. The muddy children puzzle [30] is one of the classic exam- 
ples in common knowledge literature. The puzzle consists of n children standing 
such that each child can see all other children’s faces. From the n children, an 
unknown number k > 1 have a muddy forehead, and in incremental rounds, the 
children should step forward if they know if their face is muddy or not. Consider 
the scenario of n = 2 and k = 1, so child a sees that child b has a muddy forehead 
and child 6 sees that a is clean. In this case, b immediately steps forward, as it 
knows that its forehead is muddy since k > 1. In the next step, a knows that its 
face is clean since b stepped forward in round 1. In general, one can prove that 
all children step forward in round k, deriving common knowledge. 

For each n we construct a transition system 7,, that encodes the muddy chil- 
dren scenario with n children. For every m we design a Hyper*LTLs, formula 
ym that adds to the common knowledge set X all traces that appear indistin- 
guishable in the first m steps for some child. We then specify that all traces in 
X should agree on all inputs, asserting that all inputs are common knowledge.* 
We used HySO to fully automatically check T, against Ym for varying values of n 
and m, i.e., we checked if, after the first m steps, the inputs of all children are 
common knowledge. As expected, the above property holds only if m > n (in the 
worst case, where all children are dirty (k = n), the inputs of all children only 
become common knowledge after n steps). We depict the results in Table la. 


Asynchronous Hyperproperties. As we have shown in Sect.4.2, we can 
encode arbitrary AHLTL properties into Hyper? LTL. We verified synchronous 
and asynchronous version of observational determinism (cf. Sect. 4.2) on pro- 
grams taken from [3,5,9]. We depict the verification results in Table 1b. Recall 
that Hyper*LTLs, properties without any second-order variables correspond to 


3 Our tool is publicly available at https://doi.org/10.5281/zenodo.7877144. 

t This property is not expressible in non-hyper logics such as LTLx,c, where we 
can only check trace properties on the common knowledge set X. In contrast, 
Hyper’ LT Lep allows us to check hyperproperties on X. That way, we can express 
that some value is common knowledge (i.e., equal across all traces in the set) and 
not only that a property is common knowledge (i.e., holds on all traces in the set). 
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Table 1. In Table 1a, we check common knowledge in the muddy children puzzle for 
n children and m rounds. We give the result (V if common knowledge holds and X if it 
does not), and the running time. In Table la, we check synchronous and asynchronous 
versions of observational determinism. We depict the number of iterations needed and 
running time. Times are given in seconds. 


Instance Method Res t 
m 
1 2 3 4 Tamy POD - J 0.26 
2 X v Dasins oD = x 0.31 
pei 0-93 Ton, PSU Iter (0) ¥ 0.50 
x x "4 asyn 
Tasyn, Qe” Iter (1 / 078 
n |3 | 0.79 | 0.75 | 0.54 wmPop Ter (1) 
1, = x 0.34 
; x x x / Q1, pop 
2.72 | 2.21 | 1.67 | 1.19 Q1, yop Iter (1) v 086 
(a) (b) 


HyperQPTL formulas. HySO can check such properties precisely, i.e., it consti- 
tutes a sound-and-complete model checker for HyperQPTL properties with an 
arbitrary quantifier prefix. The synchronous version of observational determin- 
ism is a HyperLTL property and thus needs no second-order approximation (we 
set the method column to “-” in these cases). 


Common Knowledge in Multi-agent Systems. We used HyS0 for an auto- 
matic analysis of the system in Fig. 1. Here, we verify that on initial trace 
{a}"{d}” it is CK that a holds in the first step. We use a similar formula as 
the one of Sect. 3.3, with the change that we are interested in whether a is CK 
(whereas we used Oa in Sect. 3.3). As expected, HySO requires 2n — 1 iterations 
to converge. We depict the results in Table 2a. 


Mazurkiewicz Traces. Mazurkiewicz traces are an important concept in the 
theory of distributed computing [27]. Let J C X x X be an independence rela- 
tion that determines when two consecutive letters can be switched (think of two 
actions in disjoint processes in a distributed system). Any t € ©” then defines 
the set of all traces that are equivalent to t by flipping consecutive independent 
actions an arbitrary number of times (the equivalence class of all these traces 
is called the Mazurkiewicz Trace). See [27] for details. The verification prob- 
lem for Mazurkiewicz traces now asks if, given some t € X“, all traces in the 
Mazurkiewicz trace of t satisfy some property 7. Using Hyper*LTLs, we can 
directly reason about the Mazurkiewicz Trace of any given trace, by requiring 
that all traces that are equal up to one swap of independent letters are also in 
a given set (which is easily expressed in Hyper? LTLfp). 
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Table 2. In Table 1a, we check common knowledge in the example from Fig. 1 when 
starting with a"d” for varying values of n. We depict the number of refinement iter- 
ations, the result, and the running time. In Table 2b, we verify various properties on 
Mazurkiewicz traces. We depict whether the property could be verified or refuted by 
iteration or automata learning, the result, and the time. Times are given in seconds. 


Instance Method Res t 
i See RS SwaPA Lean V 107 
1 Iter (1) v 0.51 SwaPATWICE Learn J 2.13 
2 Iter (3) J 0.83 SwapAs Iter (5) J 1.15 
3 Iter(5) Y 120 SWAPA15 Iter (15) Y 3.04 
10 Iter (19) Vv 3.81 SWAPAVIOLATIONs Iter (5) X 2.35 
100 Iter (199) Vv 102.8 SWAPAVIOLATION:5 Iter (15) X 4.21 
(a) (b) 


Using HySO we verify a selection of such trace properties that often require 
non-trivial reasoning by coming up with a suitable invariant. We depict the 
results in Table 2b. In our preliminary experiments, we model a situation where 
we start with {a}!{}” and can swap letters {a} and {}. We then, e.g., ask if 
on any trace in the resulting Mazurkiewicz trace, a holds at most once, which 
requires inductive invariants and cannot be established by iteration. 


7 Related Work 


In recent years, many logics for the formal specification of hyperproperties 
have been developed, extending temporal logics with explicit path quantification 
(examples include HyperLTL, HyperCTL* [20], HyperQPTL [10,45], HyperPDL 
[38], and HyperATL* [5,9]); or extending first and second-order logics with an 
equal level predicate [25,33]. Others study (w)-regular [14,37] and context-free 
hyperproperties [35]; or discuss hyperproperties over data and modulo theo- 
ries [24,31]. Hyper?LTL is the first temporal logic that reasons about second- 
order hyperproperties which allows is to capture many existing (epistemic, asyn- 
chronous, etc.) hyperlogics while at the same time taking advantage of model- 
checking solutions that have been proven successful in first-order settings. 


Asynchronous Hyperproperties. For asynchronous hyperproperties, Gutfeld et 
al. [39] present an asynchronous extension of the polyadic p-calculus. Bozelli 
et al. [17] extend HyperLTL with temporal operators that are only evaluated 
if the truth value of some temporal formula changes. Baumeister et al. present 
AHLTL [3], that extends HyperLTL with a explicit quantification over trajecto- 
ries and can be directly encoded within Hyper?LTLs). 
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Regular Model Checking. Regular model checking [15] is a general verification 
method for (possibly infinite state) systems, in which each state of the system 
is interpreted as a finite word. The transitions of the system are given as a 
finite-state (regular) transducer, and the model checking problem asks if, from 
some initial set of states (given as a regular language), some bad state is eventu- 
ally reachable. Many methods for automated regular model checking have been 
developed [12,13,19,26]. Hyper?LTL can be seen as a logical foundation for w- 
regular model checking: Assume the set of initial states is given as a QPTL 
formula Yinit, the set of bad states is given as a QPTL formula Ybaq, and the 
transition relation is given as a QPTL formula Ystep over trace variables 7 and 
a’. The set of bad states is reachable from a trace (state) in Yiniz iff the following 
Hyper?LTL¢, formula holds on the system that generates all traces: 


(X, V, YT € Õ. pin (7) > t> XA 
Yr € XNr' € GC. Pstep(T, T) > T > X). Yr E X.—YPbad (T) 


Conversely, Hyper? LTLfp can express more complex properties, beyond the 
reachability checks possible in the framework of (w-)regular model checking. 


Model Checking Knowledge. Model checking of knowledge properties in multi- 
agent systems was developed in the tools MCK [36] and MCMAS [42], which can 
exactly express LTLx. Bozzelli et al. [16] have shown that HyperCTL* and LTL« 
have incomparable expressiveness, and present HyperCT Lip —an extension of 
HyperCTL* that can reason about past — to unify HyperCTL* and LTLx. While 
HyperCTL;, can express the knowledge operator, it cannot capture common 
knowledge. LTLxk,c [41] captures both knowledge and common knowledge, but 
the suggested model-checking algorithm only handles a decidable fragment that 
is reducible to LTL model checking. 


8 Conclusion 


Hyperproperties play an increasingly important role in many areas of computer 
science. There is a strong need for specification languages and verification meth- 
ods that reason about hyperproperties in a uniform and general manner, similar 
to what is standard for more traditional notions of safety and reliability. In 
this paper, we have ventured forward from the first-order reasoning of logics 
like HyperLTL into the realm of second-order hyperproperties, i.e., properties 
that not only compare individual traces but reason comprehensively about sets 
of such traces. With Hyper?LTL, we have introduced a natural specification 
language and a general model-checking approach for second-order hyperprop- 
erties. Hyper?LTL provides a general framework for a wide range of relevant 
hyperproperties, including common knowledge and asynchronous hyperproper- 
ties, which could previously only be studied with specialized logics and algo- 
rithms. Hyper*LTL also provides a starting point for future work on second- 
order hyperproperties in areas such as cyber-physical [44] and probabilistic sys- 
tems [28]. 
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Abstract. We propose a method for certifying the fairness of the clas- 
sification result of a widely used supervised learning algorithm, the k- 
nearest neighbors (KNN), under the assumption that the training data 
may have historical bias caused by systematic mislabeling of samples 
from a protected minority group. To the best of our knowledge, this is 
the first certification method for KNN based on three variants of the 
fairness definition: individual fairness, e-fairness, and label-flipping fair- 
ness. We first define the fairness certification problem for KNN and then 
propose sound approximations of the complex arithmetic computations 
used in the state-of-the-art KNN algorithm. This is meant to lift the 
computation results from the concrete domain to an abstract domain, 
to reduce the computational cost. We show effectiveness of this abstract 
interpretation based technique through experimental evaluation on six 
datasets widely used in the fairness research literature. We also show 
that the method is accurate enough to obtain fairness certifications for 
a large number of test inputs, despite the presence of historical bias in 
the datasets. 


1 Introduction 


Certifying the fairness of the classification output of a machine learning model 
has become an important problem. This is in part due to a growing interest in 
using machine learning techniques to make socially sensitive decisions in areas 
such as education, healthcare, finance, and criminal justice systems. One rea- 
son why the classification output may be biased against an individual from a 
protected minority group is because the dataset used to train the model may 
have historical bias; that is, there is systematic mislabeling of samples from the 
protected minority group. Thus, we must be extremely careful while considering 
the possibility of using the classification output of a machine learning model, to 
avoid perpetuating or even amplifying historical bias. 

One solution to this problem is to have the ability to certify, with certainty, 
that the classification output y = M(x) for an individual input z is fair, despite 
that the model M is learned from a dataset T with historical bias. This is a 
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Fig. 1. FAIRKNN: our method for certifying fairness of KNNs with label bias. 


form of individual fairness that has been studied in the fairness literature [14]; 
it requires that the classification output remains the same for input x even if 
historical bias were not in the training dataset T. However, this is a challenging 
problem and, to the best of our knowledge, techniques for solving it efficiently 
are still severely lacking. Our work aims to fill the gap. 

Specifically, we are concerned with three variants of the fairness definition. 
Let the input x = (z1,..., £p) be a D-dimensional input vector, and P be the 
subset of vector indices corresponding to the protected attributes (e.g., race, 
gender, etc.). The first variant of the fairness definition is individual fairness, 
which requires that similar individuals are treated similarly by the machine 
learning model. For example, if two individual inputs x and 2’ differ only in some 
protected attribute x;, where i € P, but agree on all the other attributes, the 
classification output must be the same. The second variant is €-fairness, which 
extends the notion of individual fairness to include inputs whose un-protected 
attributes differ and yet the difference is bounded by a small constant (e). In 
other words, if two individual inputs are almost the same in all unprotected 
attributes, they should also have the same classification output. The third variant 
is label-flipping fairness, which requires the aforementioned fairness requirements 
to be satisfied even if a biased dataset T has been used to train the model in 
the first place. That is, as long as the number of mislabeled elements in T is 
bounded by n, the classification output must be the same. 

We want to certify the fairness of the classification output for a popular 
supervised learning technique called the k-nearest neighbors (KNN) algorithm. 
Our interest in KNN comes from the fact that, unlike many other machine 
learning techniques, KNN is a model-less technique and thus does not have 
the high cost associated with training the model. Because of this reason, KNN 
has been widely adopted in real-world applications [1,4,16, 18, 23,29, 36, 45, 46]. 
However, obtaining a fairness certification for KNN is still challenging and, in 
practice, the most straightforward approach of enumerating all possible scenarios 
and then checking if the classification outputs obtained in these scenarios agree 
would have been prohibitively expensive. 

To overcome the challenge, we propose an efficient method based on the idea 
of abstract interpretation [10]. Our method relies on sound approximations to 
analyze the arithmetic computations used by the state-of-the-art KNN algorithm 
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both accurately and efficiently. Figure 1 shows an overview of our method in the 
lower half of this figure, which conducts the analysis in an abstract domain, and 
the default KNN algorithm in the upper half, which operates in the concrete 
domain. The main difference is that, by staying in the abstract domain, our 
method is able to analyze a large set of possible training datasets (derived from 
T due to n label-flips) and a potentially-infinite set of inputs (derived from z due 
to € perturbation) symbolically, as opposed to analyze a single training dataset 
and a single input concretely. 

To the best of our knowledge, this is the first method for KNN fairness 
certification in the presence of dataset bias. While Meyer et al. [26,27] and 
Drews et al. [12] have investigated robustness certification techniques, their 
methods target decision trees and linear regression, which are different types 
of machine learning models from KNN. Our method also differs from the KNN 
data-poisoning robustness verification techniques developed by Jia et al. [20] and 
Li et al. [24], which do not focus on fairness at all; for example, they do not 
distinguish protected attributes from unprotected attributes. Furthermore, Jia et 
al. [20] consider the prediction step only while ignoring the learning step, and 
Li et al. [24] do not consider label flipping. Our method, in contrast, considers 
all of these cases. 

We have implemented our method and demonstrated the effectiveness 
through experimental evaluation. We used all of the six popular datasets in 
the fairness research literature as benchmarks. Our evaluation results show that 
the proposed method is efficient in analyzing complex arithmetic computations 
used in the state-of-the-art KNN algorithm, and is accurate enough to obtain 
fairness certifications for a large number of test inputs. To better understand 
the impact of historical bias, we also compared the fairness certification success 
rates across different demographic groups. 

To summarize, this paper makes the following contributions: 


— We propose an abstract interpretation based method for efficiently certifying 
the fairness of KNN classification results in the presence of dataset bias. The 
method relies on sound approximations to speed up the analysis of both the 
learning and the prediction steps of the state-of-the-art KNN algorithm, and 
is able to handle three variants of the fairness definition. 

— We implement the method and evaluate it on six datasets that are widely 
used in the fairness literature, to demonstrate the efficiency of our approx- 
imation techniques as well as the effectiveness of our method in obtaining 
sound fairness certifications for a large number of test inputs. 


The remainder of this paper is organized as follows. We first present the tech- 
nical background in Sect. 2 and then give an overview of our method in Sect. 3. 
Next, we present our detailed algorithms for certifying the KNN prediction step 
in Sect. 4 and certifying the KNN learning step in Sect.5. This is followed by 
our experimental results in Sect.6. We review the related work in Sect.7 and, 
finally, give our conclusion in Sect. 8. 
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2 Background 


Let L be a supervised learning algorithm that takes the training dataset T 
as input and returns a learned model M = L(T) as output. The training set 
T = {(x,y)} is a set of labeled samples, where each x € X C RP has D 
real-valued attributes, and the y € Y C N is a class label. The learned model 
M : X — Y is a function that returns the classification output y’ € V for any 
input 2’ € X. 


2.1 Fairness of the Learned Model 


We are concerned with fairness of the classification output M(x) for an individ- 
ual input x. Let P be the set of vector indices corresponding to the protected 
attributes in x € ¥. We say that x; is a protected attribute (e.g., race, gender, 
etc.) if and only if i € P. 


Definition 1 (Individual Fairness). For an input x, the classification output 
M(x) is fair if, for any input x' such that (1) x; #4 x} for some j € P and (2) 
x; = x; for alli g P, we have M(x) = M(z’). 


It means two individuals (x and 2’) differing only in some protected attribute 
(e.g., gender) but agreeing on all other attributes must be treated equally. While 
being intuitive and useful, this notion of fairness may be too narrow. For example, 
if two individuals differ in some unprotected attributes and yet the difference is 
considered immaterial, they must still be treated equally. This can be captured 
by e—fairness. 


Definition 2 (¢«-Fairness). For an input x, the classification output M(x) is 
fair if, for any input x’ such that (1) x; # xi for some j € P and (2) |x;—aj| < € 
for alli P, we have M(x) = M(z’). 


In this case, such inputs z’ form a set. Let A‘(x) be the set of all inputs 2’ con- 
sidered in the e—fairness definition. That is, A“ (x) := {a' | xj # xj for some j € 
P, |; — xi| < e for alli g P}. By requiring M(x) = M(x’) for all 2’ € A(z), 
e-fairness guarantees that a larger set of individuals similar to x are treated 
equally. 

Individual fairness can be viewed as a special case of ¢-fairness, where e€ = 0. 
In contrast, when € > 0, the number of elements in A‘(x) is often large and 
sometimes infinite. Therefore, the most straightforward approach of certifying 
fairness by enumerating all possible elements in A‘(a) would not work. Instead, 
any practical solution would have to rely on abstraction. 


2.2 Fairness in the Presence of Dataset Bias 


Due to historical bias, the training dataset T may have contained samples whose 
output are unfairly labeled. Let the number of such samples be bounded by n. 
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We assume that there are no additional clues available to help identify the mis- 
labeled samples. Without knowing which these samples are, fairness certification 
must consider all of the possible scenarios. Each scenario corresponds to a de- 
biased dataset, T’, constructed by flipping back the incorrect labels in T. Let 
dBias„ (T) = {T’} be the set of these possible de-biased (clean) datasets. Ideally, 
we want all of them to lead to the same classification output. 


Definition 3 (Label-flipping Fairness). For an input x, the classification 
output M (x) is fair against label-flipping bias of at most n elements in the dataset 
T if, for all T’ € dBias,(T), we have M'(x) = M(x) where M’ = L(T"). 


Label-flipping fairness differs from and yet complements individual and e- 
fairness in the following sense. While individual and e¢-fairness guarantee equal 
output for similar inputs, label-flipping fairness guarantees equal output for sim- 
ilar datasets. Both aspects of fairness are practically important. By combining 
them, we are able to define the entire problem of certifying fairness in the pres- 
ence of historical bias. 

To understand the complexity of the fairness certification problem, we need 
to look at the size of the set dBias,,(T), similar to how we have analyzed the size 
of A‘(x). While the size of dBias„ (T) is always finite, it can be astronomically 
large in practice. Let q is the number of unique class labels and m be the actual 
number of flipped elements in T. Assuming that each flipped label may take 
any of the other q — 1 possible labels, the total number of possible clean sets is 
(71) . (q — 1)™ for each m. Since m < n, |dBias,(T)| = ©” _, (Z) -(q— 1)”. 
Again, the number of elements in dBias,,(T) is too large to enumerate, which 
means any practical solution would have to rely on abstraction. 


3 Overview of Our Method 


Given the tuple (T,P,n,¢,x), where T is the training set, P represents the 
protected attributes, n bounds the number of biased elements in T, and € bounds 
the perturbation of x, our method checks if the KNN classification output for x 
is fair. 


3.1 The KNN Algorithm 


Since our method relies on an abstract interpretation of the KNN algorithm, 
we first explain how the KNN algorithm operates in the concrete domain (this 
subsection), and then lift it to the abstract domain in the next subsection. 

As shown in Fig.2, KNN has a prediction step where KNN_predict computes 
the output label for an input x using T and a given parameter K, and a learning 
step where KNN_learn computes the K value from the training set T. 

Unlike many other machine learning techniques, KNN does not have an 
explicit model M; instead, W can be regarded as the combination of T and 
K. 
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1 | func KNN_predict(T, K,x) { 
2 Let Te = the K nearest neighbors of x in T; 
3 Let Freq(T*) = the most frequent label in Ts 
4 return Freq(T*); 
5 |} 
6 
7 | func KNN_learn(T) { 
8 for (each candidate k value) { // conducting p-fold cross validation 
9 Let {G;} = a partition of T into p groups of roughly equal size; 
10 Let err® = {(x,y) € Gi | y # KNN_predict(T \ Gi, k,x)} for each G;; 
11 } 
k 
12 Let K = arg min 5 ae ao, 
13 return K; 
14 |} 


Fig. 2. The KNN algorithm, consisting of the prediction and learning steps. 


Inside KNN_predict, the set T represents the K-nearest neighbors of x in 
the dataset T, where distance is measured by Euclidean (or Manhattan) distance 
in the input vector space. Freq(T*) is the most frequent label in TE. 

Inside KNN_learn, a technique called p-fold cross validation is used to select 
the optimal value for K, e.g., from a set of candidate k values in the range 
(1, |T| x (p—1)/p] by minimizing classification error, as shown in Line 12. This is 
accomplished by first partitioning T into p groups of roughly equal size (Line 9), 
and then computing err! (a set of misclassified samples from G;) by treating G; 
as the evaluation set, and T \ G; as the training set. Here, an input (x,y) € Gi 
is “misclassified” if the expected output label, y, differs from the output of 
KNN_predict using the candidate k value. 


3.2 Certifying the KNN Algorithm 


Algorithm 1 shows the top-level procedure of our fairness certification method, 
which first executes the KNN algorithm in the concrete domain (Lines 1-2), to 
obtain the default K and y, and then starts our analysis in the abstract domain. 


Algorithm 1: Our method for certifying fairness of KNN for input z. 


1 K = KNN_learn(T); 

2 y = KNN_predict(T, K, x); 

3 KSet = abs_KNN_learn(T,n); 

4 for each K € KSet do 

5 if abs_KNN_predict_same(T,n, K,x,y) = False then 
6 | return unknown; 

7 end if 

8 end for 

9 return certified; 
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In the abstract learning step (Line 3), instead of considering T, our method 
considers the set of all clean datasets in dBias,,(T) symbolically, to compute the 
set of possible optimal K values, denoted K Set. 

In the abstract prediction step (Lines 4-8), for each K, instead of consider- 
ing input z, our method considers all perturbed inputs in A‘(a) and all clean 
datasets in dBiasn (T) symbolically, to check if the classification output always 
stays the same. Our method returns “certified” only when the classification out- 
put always stays the same (Line 9); otherwise, it returns “unknown” (Line 6). 

We only perturb numerical attributes in the input x since perturbing cate- 
gorical or binary attributes often does not make sense in practice. 

In the next two sections, we present our detailed algorithms for abstracting 
the prediction step and the learning step, respectively. 


4 Abstracting the KNN Prediction Step 


We start with abstract KNN prediction, which is captured by the subroutine 
abs_KNN_predict_same used in Line 5 of Algorithm 1. It consists of two parts. 
The first part (to be presented in Sect. 4.1) computes a superset of T/, denoted 
overN N, while considering the impact of e perturbation of the input x. The 
second part (to be presented in Sect.4.2) leverages overNN to decide if the 
classification output always stays the same, while considering the impact of 
label-flipping bias in the dataset T. 


4.1 Finding the K-Nearest Neighbors 


To compute over NN, which is a set of samples in T that may be the K nearest 
neighbors of the test input x, we must be able to compute the distance between 
x and each sample in T. 

This is not a problem at all in the concrete domain, since the K nearest neigh- 
bors of x in T, denoted T*, is fixed and is determined solely by the Euclidean 
distance between x and each sample in T in the attribute space. However, when e 
perturbation is applied to x, the distance changes and, as a result, the K nearest 
neighbors of x may also change. 

Fortunately, the distance in the attribute space is not affected by label- 
flipping bias in the dataset T, since label-flipping only impacts sample labels, 
not sample attributes. Thus, in this subsection, we only need to consider the 
impact of e perturbation of the input x. 


The Challenge. Due to e perturbation, a single test input x becomes a 
potentially-infinite set of inputs A‘(a). Since our goal is to over-approximate 
the K nearest neighbors of A‘(a), the expectation is that, as long as there exists 
some x’ € A(x) such that a sample input t in T is one of the K nearest neighbors 
of x’, denoted t € TX, we must include t in the set overNN. That is, 


U TĚ C overNN CT. 
a! EAS (x) 
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However, finding an efficient way of computing over NN is a challenging task. As 
explained before, the naive approach of enumerating x’ € A‘(x), computing the 
K nearest neighbors, T#, and unionizing all of them would not work. Instead, 
we need abstraction that is both efficient and accurate enough in practice. 

Our solution is that, for each sample t in T, we first analyze the distances 

between ¢ and all inputs in A‘(a) symbolically, to compute a lower bound and an 
upper bound of the distances. Then, we leverage these lower and upper bounds to 
compute the set over N N, which is a superset of samples in T that may become 
the K nearest neighbors of A‘(zr). 
Bounding Distance Between A‘(xz) and t. Assume that x = (x1, £2, ..., £D) 
and t = (t1, t2, ..., tp) are two real-valued vectors in the D-dimensional attribute 
space. Let € = (€1, €2,...,€p), where €; > 0, be the small perturbation. Thus, the 
perturbed input is 2’ = (£1, 25,...,2'5) = (xı + 61,22 + 2, ..., £D + êp), where 
6; E€ [-e;,¢;] for all i =1,...,D. 

The distance between z and t is a fixed value d(x, t) = eae (xi — t;)?, since 
both x and the samples t in T are fixed, but the distance between x’ € A(x) and 


t is a function of 6; € [—e;, €i], since Vr2 ale -t)2= JELE — ti + 6;)?. 


For ease of presentation, we define the distance as d€ = VEZ: d$, where dj = 


(x; — ti + ĝi)? is the (squared) distance function in the i-th dimension. Then, 
our goal becomes computing the lower bound, LB(d‘), and the upper bound, 
UB(d‘), in the domain 6; € [—e;, €;] for all i =1,..., D. 


Distance Bounds Are Compositional. Our first observation is that bounds 
on the distance df as a whole can be computed using bounds in the individual 
dimensions. To see why this is the case, consider the (square) distance in the i-th 
dimension, d$ = (a; — ti + 6;)?, where ð; € [—€;, €i], and the (square) distance in 
the j-th dimension, d§ = (x; — tj + 6;)?, where 6; € [—e;, €j]. By definition, dọ 
is completely independent of d; when i Æ j. 

Thus, the lower bound of df, denoted LB(d*), can be calculated by finding 
the lower bound of each d§ in the i-th dimension. Similarly, the upper bound of 
dt, denoted U B(d‘), can also be calculated by finding the upper bound of each 
d$ in the i-the dimension. That is, 


LB(d°) = / 02, LB(ds) and UB(d*) = \/ 2, UB(ds). 


Four Cases in Each Dimension. Our second observation is that, by utilizing 
the mathematical nature of the (square) distance function, we can calculate the 
minimum and maximum values of df, which can then be used as the lower bound 
LB(d$) and upper bound U B(d§), respectively. 

Specifically, in the i-th dimension, the (square) distance function d§ = ((#; — 
ti) + ĝi)? may be rewritten to (6; + A)?, where A = (a; — ti) is a constant and 
6; E€ [—e, +e] is a variable. The function can be plotted in two dimensional space, 
using 6; as z-axis and the output of the function as y-axis; thus, it is a quadratic 
function Y = (X + A)?. 
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Fig. 3. Four cases for computing the upper and lower bounds of the distance function 
d; (ôi) = (5; + A)? for 6; € [—e:, &:]. In these figures, 6; is the x-axis, and dj is the y-axis, 
LB denotes L.B(d;), and UB denotes U B(d§). 


Figure 3 shows the plot, which reminds us of where the minimum and maxi- 
mum values of a quadratic function is. There are two versions of the quadratic 
function, depending on whether A > 0 (corresponding to the two subfigures at 
the top) or A < 0 (corresponding to the two subfigures at the bottom). Each ver- 
sion also has two cases, depending on whether the perturbation interval [—e;, €;] 
falls inside the constant interval [—|A],|A|] (corresponding to the two subfigures 
on the left) or falls outside (corresponding to the two subfigures on the right). 
Thus, there are four cases in total. 

In each case, the maximal and minimal values of the quadratic function are 
different, as shown by the LB and UB marks in Fig. 3. 


Case (a). This is when (a; — ti) > 0 and —e; > —(a; — ti), which is the same 
as saying A > 0 and —e; > —A. In this case, function dj(e;) = (6; + A)? is 
monotonically increasing w.r.t. variable 6; € [—e;, +e]. 

Thus, LB(d§) = (—e; + (x; — t;))? and UB(d) = (+e; + (2; —t;))?. 


Case (b). This is when (a; — ti) > 0 and —e; < —(a; — ti), which is the same 
as saying A > 0 and —e; < —A. In this case, the function is not monotonic. 
The minimal value is 0, obtained when 6; = —A. The maximal value is obtained 
when 6; = +€;. 

Thus, LB(d£) = 0 and UB(d‘) = (+e; + (x; — t;))?. 
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Case (c). This is when (a; — ti) < 0 and e; < — (x; — ti), which is the same as 
saying A < 0 and e; < —A. In this case, the function is monotonically decreasing 
w.r.t. variable 6; € [—€;, €;]. 


Case (d). This is when (x; — ti) < 0 and é; > —(a; — ti), which is the same 
as saying A < 0 and e; > —A. In this case, the function is not monotonic. The 
minimal value is 0, obtained when 6; = —A. The maximal value is obtained 
when 6; = —«. 


Thus, LB(d$) = 0 and UB(dS) = (—e; + (a; — ti))?. 


Summary. By combining the above four cases, we compute the bounds of the 
entire distance function d* as follows: 


D 


D 
XC max(|a; — ti] — €,0)2, X (\2i = til + €)? 
w=1 


i=l 


Here, the take-away message is that, since x;, t; and €e; are all fixed values, the 
upper and lower bounds can be computed in constant time, despite that there 
is a potentially-infinite number of inputs in A‘(x). 
Computing overNN Using Bounds. With the upper and lower bounds of the 
distance between A‘(a) and sample t in the dataset T, denoted [LB(d‘(z,t)), 
U B(d‘(x,t))], we are ready to compute overNN such that every t € overNN 
may be among the K nearest neighbors of A‘(z). 

Let UBxmin denote the K-th minimum value of UB(d‘(a,t)) for all t € T. 
Then, we define overNN as the set of samples in T whose LB(d‘(x,t)) is not 
greater than UBxmin. In other words, 


overNN = {tE T | LB(d*(a,t)) < UBKmin}. 


Example. Given a dataset T = {t',t?,t3,t*,t°}, a test input x, perturba- 
tion €, and K = 3. Assume that the lower and upper bounds of the dis- 
tance between A‘(x) and samples in T are [25.4, 29.4], [30.1, 34.1], [35.3, 39.3], 
(37.2, 41.2], [85.5, 90.5]. Since K = 3, we find the 3rd minimum upper bound, 
U Bzmin = 39.3. By comparing U B3min with the lower bounds, we compute 
overN N3 = {t',t?, t3, t*}, since t? is the only sample in T whose lower bound 
is greater than 39.3. All the other four samples may be among the 3 nearest 
neighbors of A‘(x). 

Due to e perturbation, the set overN N; for K = 3 is expected to contain 
3 or more samples. That is, since different inputs in A(x) may have different 
samples as their 3-nearest neighbors, to be conservative, we have to take the 
union of all possible sets of 3-nearest neighbors. 
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Algorithm 2: Subroutine abs_same_label(overNN, K,y). 


Let S be a subset of overNN obtained by removing all y-labeled elements; 
Let y’ = Freq(S), and #y’ be the count of y’-labeled elements in S; 
if #y' < K — |S| — 2x n then 
| return True; 
end if 
return False; 


anak WN BR 


Soundness Proof. Here we prove that any t ¢ overNNx cannot be among 
the K nearest neighbors of any x’ € A‘(x). Since UBKmin is the K-th min- 
imum UB(d‘(z,t)) for all t € T, there must be samples t!,#?,...t“ such that 
UB(d‘(x,tt)) < UBxmin for all i = 1,2,...K. For any t ¢ overNN, we have 
LB(d‘*(a,t’)) > UBK min. 

Combining the above conditions, we have LB(d‘(z,t’)) > UB(d*(a,t*)) for 
i = 1,2,...K. It means at least K other samples are closer to x than t’. Thus, t’ 
cannot be among the K-nearest neighbors of x’. 


4.2 Checking the Classification Result 


Next, we try to certify that, regardless of which of the K elements are selected 
from overNN, the prediction result obtained using them is always the same. 

The prediction label is affected by both e€ perturbation of the input x and 
label-flipping bias in the dataset T. Since € perturbation affects which points are 
identified as the K nearest neighbors, and its impact has been accounted for by 
over NN, from now on, we focus only on label-flipping bias in T. 

Our method is shown in Algorithm 2, which takes the set overNN, the 
parameter K, and the expected label y as input, and checks if it is possible to 
find a subset of overNN with size K, whose most frequent label differs from 
y. If such a “bad” subset cannot be found, we say that KNN prediction always 
returns the same label. 

To try to find such a “bad” subset of overN N, we first remove all elements 
labeled with y from over NN, to obtain the set S (Line 1). After that, there are 
two cases to consider. 


1. If the size of S is equal to or greater than K, then any subset of S with 
size K must have a different label because it will not contain any element 
labeled with y. Thus, the condition in Line 3 of Algorithm 2 is not satisfied 
(#y’ is a positive number, and right-hand side is a negative number), and the 
procedure returns False. 

2. If the size of S, denoted |.S|, is smaller than K, the most likely “bad” subset 
will be Sk = SU { any (K — |S]) y-labeled elements from over NN}. In this 
case, we need to check if the most frequent label in Sx is y or not. 


In Sg, the most frequent label must be either y (whose count is K — ||) or 
y’ (which is the most frequent label in S, with the count #y’). Moreover, since 
we can flip up to n labels, we can flip n elements from label y to label y’. 
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Algorithm 3: Subroutine abs_K N N_learn(T,n) 


1 for each candidate k value do 

Let {G;} = a partition of T into p groups of roughly equal size; 

errU BE = {(2,y) € G; | abs_may_err (T\Gi,n,k,x,y) = true} for each Gi; 
errLBE = {(x,y) € G; | abs must_err(T \ Gi, n, k, x,y) = true} for each Gj; 
UB, = } Y? lerrUBE|/|Gil; 

6 LB, = + $`? |errLB¥|/|Gil; 


p i=l 


a A N 


7 end for 
8 Let minUB = min({U B1, ..., U Bp}); 
9 return KSet = {k | LB, < minUB}; 


Therefore, to check if our method should return True, meaning the prediction 
result is guaranteed to be the same as label y, we only need to compare K — |S| 
with #y’ + 2*n. This is checked using the condition in Line 3 of Algorithm 2. 


5 Abstracting the KNN Learning Step 


In this section, we present our method for abstracting the learning step, which 
computes the optimal K value based on T and the impact of flipping at most n 
labels. The output is a super set of possible optimal K values, denoted K Set. 

Algorithm 3 shows our method, which takes the training set T and parameter 
n as input, and returns K Set as output. To be sound, we require the K Set to 
include any candidate k value that may become the optimal K for some clean 
set T” € dBias,,(T). 

In Algorithm 3, our method first computes the lower and upper bounds of 
the classification error for each k value, denoted LB, and U Bk, as shown in 
Lines 5-6. Next, it computes minU B, which is the minimal upper bound for all 
candidate k values (Line 8). Finally, by comparing minUB with LB, for each 
candidate k value, our method decides whether this candidate k value should be 
put into K Set (Line 9). 

We will explain the steps needed to compute DB; and U By, in the remainder 
of this section. For now, assuming that they are available, we explain how they 
are used to compute K Set. 


Example. Given the candidate k values, k,,k2,k3,k4, and their error bounds 
(0.1, 0.2], [0.1, 0.3], [0.3, 0.4], [0.3,0.5]. The smallest upper bound is minUB = 
0.2. By comparing minU B with the lower bounds, we compute K Set = {ky, ko}, 
since only LB;,, and LB,, are lower than or equal to minU B. 


Soundness Proof. Here we prove that any k’ ¢ K Set cannot result in the smallest 
classification error. Assume that k, is the candidate k value that has the minimal 
upper bound (minU B), and err ,, is the actual classification error. By definition, 
we have err, < minUB. Meanwhile, for any k’ ¢ KSet, we have LBy > 


Certifying the Fairness of KNN in the Presence of Dataset Bias 347 


Algorithm 4: Subroutine abs_may_err(T,n, K, x,y). 


1 Let y’ be, among the non-y labels, the label with the highest count in TX; 
2 Let #y be the number of elements in T* with the y label; 

3 Let n’ be min(n, #y € TS); 

4 Changing n’ elements in TË from y label to y’ label; 

5 return Freq(TX) £ y; 


minU B. Combining the two cases, we have erry) > minUB > err,,. Here, 
erry > err, means that k’ cannot result in the smallest classification error. 


5.1 Overapproximating the Classification Error 


To compute the upper bound errU BF defined in Line 3 of Algorithm 3, we use 
the subroutine abs_may_err to check if (x,y) € Gi may be misclassified when 
using T \ G; as the training set. 

Algorithm 4 shows the implementation of the subroutine, which checks, for 
a sample (x,y), whether it is possible to obtain a set S by flipping at most n 
labels in T* such that the most frequent label in S is not y. If it is possible to 
obtain such a set S, we conclude that the prediction label for x may be an error. 

The condition Freq(T*) Æ y, computed on T* after the y label of n’ ele- 
ments is changed to y’ label, is a sufficient condition under which the prediction 
label for x may be an error. The rationale is as follows. 

In order to make the most frequent label in the set TX different from y, we 
need to focus on the label most likely to become the new most frequent label. It 
is the label y'(¢ y) with the highest count in the current TX. 

Therefore, Algorithm 4 checks whether y’ can become the most frequent label 
by changing at most n elements in T from y label to y’ label (Lines 3-5). 


5.2 Underapproximating the Classification Error 


To compute the lower bound errL BF defined in Line 4 of Algorithm 3, we use 
the subroutine abs_must_err to check if (x,y) € G; must be misclassified when 
using T \ G; as the training set. 

Algorithm 5 shows the implementation of the subroutine, which checks, for 
a sample (x,y), whether it is impossible to obtain a set S by flipping at most 
n labels in T* such that the most frequent label in S' is y. In other words, is 
it impossible to avoid the classification error? If it is impossible to avoid the 
classification error, we conclude that the prediction label must be an error, and 
thus the procedure returns True 

In this sense, all samples in err LBF (computed in Line 4 of Algorithm 3 are 
guaranteed to be misclassified. 
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Algorithm 5: Subroutine abs_must_err(T,n, K, x,y). 


1 if IS obtained from TX by flipping up to n labels such that Freq(S) = y then 
2 | return False; 

3 end if 

4 return True; 


The challenge in Algorithm 5 is to check if such a set S can be constructed 
from TE. The intuition is that, to make y the most frequent label, we should 
flip the labels of non-y elements to label y. Let us consider two examples first. 


Example 1. Given the label counts of TX, denoted {l * 4, l4 * 4, l3 * 2}, 
meaning that 4 elements are labeled l4, 4 elements are labeled l4, and 2 elements 
are labeled l3. Assume that n = 2 and y = ls. Since we can flip at most 2 
elements, we choose to flip one lı — l3 and one l4 — l3, to get a set S = {l, * 
3, l4 * 3, l3 * 4}. 


Example 2. Given the label counts of TZ, denoted {l * 5, l4 * 3, l3 * 2}, n = 2, 
and y = l3. We can flip two lı — l3 to get a set S = {l1 * 3, l4 * 3, l3 * 4}. 


The LP Problem. The question is how to decide whether the set S (defined 
in Line 1 of Algorithm 5) exists. We can formulate it as a linear programming 
(LP) problem. The LP problem has two constraints. The first one is defined as 
follows: Let y be the expected label, l; Æ y be another label, where i = 1,...,q 
and q is the total number of class labels (e.g., in the above two examples, the 
number q = 3). Let #y be the number of elements in TZ that have the y label. 
Similarly, let #1; be the number of elements with l; label. Assume that a set S 
as defined in Algorithm 5 exists, then all of the labels l; 4 y must satisfy 


q 
#li — # flipi < #y + 5 # flipi , (1) 


i=l 


where # flip; is a variable representing the number of l;—to-y flips. Thus, in the 
above formula, the left-hand side is the count of l; after flipping, the right-hand 
side is the count of y after flipping. Since y is the most frequent label in S, y 
should have a higher count than any other label. 

The second constraint is 


q 
XN #flipi <n , (2) 


i=1 


which says that the total number of label flips is bounded by the parameter n. 

Since the number of class labels (q) is often small (from 2 to 10), this LP 
problem can be solved quickly. However, the LP problem must be solved |T| 
times, where |T| may be as large as 50,000. To avoid invoking the LP solver 
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unnecessarily, we propose two easy-to-check conditions. They are necessary con- 
dition in that, if either of them is violated, the set S does not exist. Thus, we 
invoke the LP solver only if both conditions are satisfied. 


Necessary Conditions. The first condition is derived from Formula (la), by 
adding up the two sides of the inequality constraint for all labels |; 4 y. The 
resulting condition is 


S #1: -Y # flip: | < («- 1)#y + (4-1) Sosin) . 


liży i=1 i=1 


The second condition requires that, in S, label y has a higher count (after flip- 
ping) than any other label, including the label l, Æ y with the highest count in 
the current TX. The resulting condition is 


(#lp — #y)/2 < n, 


since only when this condition is satisfied, it is possible to allow y to have a 
higher count than lp, by flipping at most n of the label lp to y. 

These are necessary conditions (but may not be sufficient conditions) because, 
whenever the first condition does not hold, Eq. (1) does not hold either. Similarly, 
whenever the second condition does not hold, Eq. (1) does not hold either. In this 
sense, these two conditions are easy-to-check over-approximations of Eq. (1). 


6 Experiments 


We have implemented our method as a software tool written in Python using 
the scikit-learn machine learning library. We evaluated our tool on six datasets 
that are widely used in the fairness research literature. 


Datasets. Table 1 shows the statistics of each dataset, including the name, 
a short description, the size (|T|), the number of attributes, the protected 
attributes, and the parameters € and n. The value of € is set to 1% of the attribute 
range. The bias parameter n is set to 1 for small datasets, 10 for medium datasets, 
and 50 for large datasets. The protected attributes include Gender for all six 
datasets, and Race for two datasets, Compas and Adult, which are consistent 
with known biases in these datasets. 

In preparation for the experimental evaluation, we have employed state-of- 
the-art techniques in the machine learning literature to preprocess and balance 
the datasets for KNN, including encoding, standard scaling, k-bins-discretizer, 
downsampling and upweighting. 
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Table 1. Statistics of all of the datasets used during our experimental evaluation. 


Dataset | Description Size |T| | # Attr. | Protected Attr. | Parameters € and n 

Salary salary level [42] 52 4 Gender e = 1% attribute range, n = 1 
Student | academic performance [9] |649 30 Gender e = 1% attribute range, n = 1 
German | credit risk [13] 1,000 20 Gender e = 1% attribute range, n = 10 
Compas _ | recidivism risk [11] 10,500 16 Race+Gender |e = 1% attribute range, n = 10 
Default | loan default risk [47] 30,000 36 Gender e = 1% attribute range, n = 50 
Adult earning power [13] 48,842 14 Race+Gender e = 1% attribute range, n = 50 


Table 2. Results for certifying label-flipping and individual fairness (gender) on small 
datasets, for which ground truth can still be obtained by naive enumeration, and com- 
pared with our method. 


Certifying label-flipping fairness Certifying label-flipping + individual fairness 
Ground Our Ground Our 
Name truth Time | method | Time | Accuracy | Speedup | truth Time | method | Time | Accuracy | Speedup 


Salary |50.0% |1.7s | 33.3% |0.2s | 66.7% 8.5X 33.3% |1.5s | 33.3% |0.2s | 100% 7.5X 
Student | 70.8% | 23.0s | 60.0% |0.2s | 84.7% 115X 58.5% |25.2s | 44.6% |0.2s |76.2% 116X 


Methods. For comparison purposes, we implemented six variants of our 
method, by enabling or disabling the ability to certify label-flipping fairness, 
the ability to certify individual fairness, and the ability to certify e-fairness. 

Except for e-fairness, we also implemented the naive approach of enumerating 
all T” € dBias,,(T'). Since the naive approach does not rely on approximation, 
its result can be regarded as the ground truth (i.e., whether the classification 
output for an input «v is truly fair). Our goal is to obtain the ground truth on 
small datasets, and use it to evaluate the accuracy of our abstract interpretation 
based method. However, as explained before, enumeration does not work for 
e-fairness, since the number of inputs in A‘(x) is infinite. 

Our experiments were conducted on a computer with 2 GHz Quad-Core Intel 
Core i5 CPU and 16 GB of memory. The experiments were designed to answer 
two questions. First, is our method efficient and accurate enough in handling 
popular datasets in the fairness literature? Second, does our method help us 
gain insights? For example, it would be interesting to know whether decision 
made on an individuals from a protected minority group is more (or less) likely 
to be certified as fair. 


Results on Efficiency and Accuracy. We first evaluate the efficiency and 
accuracy of our method. For the two small datasets, Salary and Student, we are 
able to obtain the ground truth using the naive enumeration approach, and then 
compare it with the result of our abstract interpretation based method. We want 
to know how much our results deviate from the ground truth. 

Table2 shows the results obtained by treating Gender as the protected 
attribute. Column 1 shows the name of the dataset. Columns 2—7 compare the 
naive approach (ground truth) and our method in certifying label-flipping fair- 
ness. Columns 8-13 compare the naive approach (ground truth) and our method 
in certifying label-flipping plus individual fairness. 
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Table 3. Results for certifying label-flipping, individual, and €-fairness by our method. 


Name Label-flipping fairness | Time | + Individual fairness | Time | + e-fairness | Time 
Salary (gender) | 33.3% 0.2s | 33.3% 0.2s | 33.3% 0.2s 
Student (gender) | 60.0% 0.28 | 44.6% 0.28 | 32.3% 0.2s 
German (gender) | 48.0% 0.28 | 44.0% 0.3s | 43.0% 0.2s 
Compas (race) 95.0% 0.3s | 62.4% 1.4s | 56.4% 11s 
Compas (gender) | 95.0% 0.3s | 65.3% 1.3s | 59.4% 1.0s 
Default (gender) | 83.2% 2.38 | 73.3% 44s | 64.4% 3.5s 
Adult (race) 76.2% 2.28 | 65.3% 4.58 | 53.5% 5.38 
Adult (gender) 76.2% 2.28 | 52.5% 3.58 | 43.6% 3.3s 


Based on the results in Table 2, we conclude that the accuracy of our method 
is high (81.9% on average) despite its aggressive use of abstraction to reduce 
the computational cost. Our method is also 7.5X to 126X faster than the naive 
approach. Furthermore, the larger the dataset, the higher the speedup. 

For medium and large datasets, it is infeasible for the naive enumeration 
approach to compute and show the ground truth in Table 2. However, the fairness 
scores of our method shown in Table3 provide “lower bounds” for the ground 
truth since our method is sound for certification. For example, when our method 
reports 95% for Compas (race) in Table 3, it means the ground truth must be 
>95% (and thus the gap must be <5%). However, there does not seem to be 
obvious relationship between the gap and the dataset size — the gap may be due 
to some unique characterristics of each dataset. 


Results on the Certification Rates. We now present the success rates of 
our certification method for the three variants of fairness. Table3 shows the 
results for label-flipping fairness in Columns 2-3, label-flipping plus individual 
fairness (denoted + Individual fairness) in Columns 4-5, and label-flipping plus 
e-fairness (denoted + €-fairness) in Columns 6-7. For each variant of fairness, 
we show the percentage of test inputs that are certified to be fair, together with 
the average certification time (per test input). In all six datasets, Gender was 
treated as the protected attribute. In addition, Race was treated as the protected 
attribute for Compas and Adult. 

From the results in Table3, we see that as more stringent fairness standard 
is used, the certified percentage either stays the same (as in Salary) or decreases 
(as in Student). This is consistent with what we expect, since the classification 
output is required to stay the same for an increasingly larger number of scenar- 
ios. For Compas (race), in particular, adding ¢-fairness on top of label-flipping 
fairness causes the certified percentage to drop from 62.4% to 56.4%. 

Nevertheless, our method still maintains a high certification percentage. 
Recall that, for Salary, the 33.3% certification rate (for +Individual fairness) 
is actually 100% accurate according to comparison with the ground truth in 
Table 2, while the 44.6% certification rate (for +Jndividual fairness) is actually 
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76.2% accurate. Furthermore, the efficiency of our method is high: for Adult, 
which has 50,000 samples in the training set, the average certification time of 
our method remains within a few seconds. 


Table 4. Results for certifying label-flipping + €-fairness with both Race and Gender 
as protected attributes. 


| White | Other | Wt. Avg | White | Other | Wt. Avg 

7 07 me 07 ale 35.3% 33.3% 35.1% 

(a) Compas | Male | 61.9% | 52.2% | 52.8% O) Adu [Male | 35.3% | 33.3% | 35.1% 
Female | 100% | 60.0% | 63.7% Female | 33.3% | 66.7% | 37.0% 

Wt. Avg | 63.7% | 53.7% | 54.4% Wt. Avg | 34.7% | 44.4% | 35.6% 


Results on Demographic Groups. Table 4 shows the certified percentage of 
each demographic group, when both label-flipping and e-fairness are considered, 
and both Race and Gender are treated as protected attributes. The four demo- 
graphic groups are (1) White Male, (2) White Female, (3) Other Male, and (4) 
Other Female. For each group, we show the certified percentage obtained by our 
method. In addition, we show the weighted averages for White and Other, as 
well as the weighted averages for Male and Female. 

For Compas, White Female has the highest certified percentage (100%) while 
Other Female has the lowest certified percentage (52.2%); here, the classification 
output represents the recidivism risk. 

For Adult, Other Female has the highest certified percentage (66.7%) while 
the other three groups have certified percentages in the range of 33.3%-35.3%. 

The differences may be attributed to two sources, one of which is technical 
and the other is social. The social reason is related to historical bias, which is 
well documented for these datasets. If the actual percentages (ground truth) is 
different, the percentages reported by our method will also be different. The 
technical reason is related to the nature of the KNN algorithm itself, which we 
explain as follows. 

In these datasets, some demographic groups have significantly more samples 
than others. In KNN, the lowest occurring group may have a limited number 
of close neighbors. Thus, for each test input x from this group, its K nearest 
neighbors tend to have a larger radius in the input vector space. As a result, 
the impact of e perturbation on x will be smaller, resulting in fewer changes to 
its K nearest neighbors. That may be one of the reasons why, in Table 4, the 
lowest occurring groups, White Female in Compas and Other Female in Adult, 
have significantly higher certified percentage than other groups. 

Results in Table 4 show that, even if a machine learning technique discrim- 
inates against certain demographic groups, for an individual, the prediction 
result produced by the machine learning technique may still be fair. This is 
closely related to differences (and sometimes conflicts) between group fairness 
and individual fairness: while group fairness focuses on statistical parity, individ- 
ual fairness focuses on similar outcomes for similar individuals. Both are useful 
notions and in many cases they are complementary. 
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Caveat. Our work should not be construed as an endorsement nor criticism of 
the use of machine learning techniques in socially sensitive applications. Instead, 
it should be viewed as an effort on developing new methods and tools to help 
improve our understanding of these techniques. 


7 Related Work 


For fairness certification, as explained earlier in this paper, our method is the 
first method for certifying KNN in the presence of historical (dataset) bias. 
While there are other KNN certification and falsification techniques, including 
Jia et al. [20] and Li et al. [24,25], they focus solely on robustness against data 
poisoning attacks as opposed to individual and e-fairness against historical bias. 
Meyer et al. [26,27] and Drews et al. [12] propose certification techniques that 
handle dataset bias, but target different machine learning techniques (decision 
tree or linear regression); furthermore, they do not handle ¢-fairness. 

Throughout this paper, we have assumed that the KNN learning (parameter- 
tuning) step is not tampered with or subjected to fairness violation. However, 
since the only impact of tampering with the KNN learning step will be changing 
the optimal value of the parameter K, the biased KNN learning step can be 
modeled using a properly over-approximated K Set. With this new K Set, our 
method for certifying fairness of the prediction result (as presented in Sect. 4) 
will work AS IS. 

Our method aims to certify fairness with certainty. In contrast, there are 
statistical techniques that can be used to prove that a system is fair or robust 
with a high probability. Such techniques have been applied to various machine 
learning models, for example, in VeriFair [6] and FairSquare [2]. However, they 
are typically applied to the prediction step while ignoring the learning step, 
although the learning step may be affected by dataset bias. 

There are also techniques for mitigating bias in machine learning systems. 
Some focus on improving the learning algorithms using random smoothing [33], 
better embedding [7] or fair representation [34], while others rely on formal 
methods such as iterative constraint solving [38]. There are also techniques for 
repairing models to improve fairness [3]. Except for Ruoss et al. [34], most of 
them focus on group fairness such as demographic parity and equal opportunity; 
they are significantly different from our focus on certifying individual and e- 
fairness of the classification results in the presence of dataset bias. 

At a high level, our method that leverages a sound over-approximate analysis 
to certify fairness can be viewed as an instance of the abstract interpretation 
paradigm [10]. Abstract interpretation based techniques have been successfully 
used in many other settings, including verification of deep neural networks [17, 
30], concurrent software [21,22,37], and cryptographic software [43,44]. 

Since fairness is a type of non-functional property, the verifica- 
tion/certification techniques are often significantly different from techniques used 
to verify/certify functional correctness. Instead, they are more closely related to 
techniques for verifying/certifying robustness [8], noninterference [5], and side- 
channel security [19,39,40, 48], where a program is executed multiple times, each 
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time for a different input drawn from a large (and sometimes infinite) set, to see 
if they all agree on the output. At a high level, this is closely related to differen- 
tial verification [28,31,32], synthesis of relational invariants [41] and verification 
of hyper-properties [15,35]. 


8 Conclusions 


We have presented a method for certifying the individual and e-fairness of the 
classification output of the KNN algorithm, under the assumption that the train- 
ing dataset may have historical bias. Our method relies on abstract interpreta- 
tion to soundly approximate the arithmetic computations in the learning and 
prediction steps. Our experimental evaluation shows that the method is efficient 
in handling popular datasets from the fairness research literature and accurate 
enough in obtaining certifications for a large amount of test data. While this 
paper focuses on KNN only, as a future work, we plan to extend our method to 
other machine learning models. 
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Abstract. Machine-learned systems are in widespread use for making 
decisions about humans, and it is important that they are fair, i.e., not 
biased against individuals based on sensitive attributes. We present run- 
time verification of algorithmic fairness for systems whose models are 
unknown, but are assumed to have a Markov chain structure. We intro- 
duce a specification language that can model many common algorithmic 
fairness properties, such as demographic parity, equal opportunity, and 
social burden. We build monitors that observe a long sequence of events 
as generated by a given system, and output, after each observation, a 
quantitative estimate of how fair or biased the system was on that run 
until that point in time. The estimate is proven to be correct modulo a 
variable error bound and a given confidence level, where the error bound 
gets tighter as the observed sequence gets longer. Our monitors are of two 
types, and use, respectively, frequentist and Bayesian statistical inference 
techniques. While the frequentist monitors compute estimates that are 
objectively correct with respect to the ground truth, the Bayesian mon- 
itors compute estimates that are correct subject to a given prior belief 
about the system’s model. Using a prototype implementation, we show 
how we can monitor if a bank is fair in giving loans to applicants from 
different social backgrounds, and if a college is fair in admitting stu- 
dents while maintaining a reasonable financial burden on the society. 
Although they exhibit different theoretical complexities in certain cases, 
in our experiments, both frequentist and Bayesian monitors took less 
than a millisecond to update their verdicts after each observation. 


1 Introduction 


Runtime verification complements traditional static verification techniques, by 
offering lightweight solutions for checking properties based on a single, possibly 
long execution trace of a given system [8]. We present new runtime verification 
techniques for the problem of bias detection in decision-making software. The use 
of software for making critical decisions about humans is a growing trend; exam- 
ple areas include judiciary [13, 20], policing [23,49], banking [48], etc. It is impor- 
tant that these software systems are unbiased towards the protected attributes 


This work is supported by the European Research Council under Grant No.: 
ERC-2020-AdG101020093. 
© The Author(s) 2023 


C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 358-382, 2023. 
https: / /doi-org/10.1007/978-3-031-37703-7_17 


Monitoring Algorithmic Fairness 359 


of humans, like gender, ethnicity, etc. However, they have often shown biases in 
their decisions in the past [20,47,55,57,58]. While there are many approaches 
for mitigating biases before deployment [20,47,55,57,58], recent runtime verifi- 
cation approaches [3,34] offer a new complementary tool to oversee algorithmic 
fairness in AI and machine-learned decision makers during deployment. 

To verify algorithmic fairness at runtime, the given decision-maker is treated 
as a generator of events with an unknown model. The goal is to algorithmically 
design lightweight but rigorous runtime monitors against quantitative formal 
specifications. The monitors observe a long stream of events and, after each 
observation, output a quantitative, statistically sound estimate of how fair or 
biased the generator was until that point in time. While the existing approaches 
[3,34] considered only sequential decision making models and built monitors 
from the frequentist viewpoint in statistics, we allow the richer class of Markov 
chain models and present monitors from both the frequentist and the Bayesian 
statistical viewpoints. 

Monitoring algorithmic fairness involves on-the-fly statistical estimations, a 
feature that has not been well-explored in the traditional runtime verification 
literature. As far as the algorithmic fairness literature is concerned, the existing 
works are mostly model-based, and either minimize decision biases of machine- 
learned systems at design-time (i.e., pre-processing) [11,41,65,66], or verify their 
absence at inspection-time (i.e., post-processing) [32]. In contrast, we verify algo- 
rithmic fairness at runtime, and do not require an explicit model of the gener- 
ator. On one hand, the model-independence makes the monitors trustworthy, 
and on the other hand, it complements the existing model-based static analyses 
and design techniques, which are often insufficient due to partially unknown or 
imprecise models of systems in real-world environments. 

We assume that the sequences of events generated by the generator can 
be modeled as sequences of states visited by a finite unknown Markov chain. 
This implies that the generator is well-behaved and the events follow each other 
according to some fixed probability distributions. Not only is this assumption 
satisfied by many machine-learned systems (see Sect. 1.1 for examples), it also 
provides just enough structure to lay the bare-bones foundations for runtime 
verification of algorithmic fairness properties. We emphasize that we do not 
require knowledge of the transition probabilities of the underlying Markov chain. 

We propose a new specification language, called the Probabilistic Specifica- 
tion Expressions (PSEs), which can formalize a majority of the existing algo- 
rithmic fairness properties in the literature, including demographic parity [21], 
equal opportunity [32], disparate impact [25], etc. Let Q be the set of events. 
Syntactically, a PSE is a restricted arithmetic expression over the (unknown) 
transition probabilities of a Markov chain with the state space Q. Semantically, 
a PSE over Q is a function that maps every Markov chain M with the state 
space Q to a real number, and the value y(M) represents the degree of fairness 
or bias (with respect to y) in the generator M. Our monitors observe a long 
sequence of events from Q, and after each observation, compute a statistically 
rigorous estimate of y(M) with a PAC-style error bound for a given confidence 
level. As the observed sequence gets longer, the error bound gets tighter. 
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Algorithmic fairness properties that are expressible using PSEs are quan- 
titative refinements of the traditional qualitative fairness properties studied in 
formal methods. For example, a qualitative fairness property may require that 
if a certain event A occurs infinitely often, then another event B should follow 
infinitely often. In particular, a coin is qualitatively fair if infinitely many coin 
tosses contain both infinitely many heads and infinitely many tails. In contrast, 
the coin will be algorithmically fair (i.e., unbiased) if approximately half of the 
tosses come up heads. Technically, while qualitative weak and strong fairness 
properties are w-regular, the algorithmic fairness properties are statistical and 
require counting. Moreover, for a qualitative fairness property, the satisfaction or 
violation cannot be established based on a finite prefix of the observed sequence. 
In contrast, for any given finite prefix of observations, the value of an algorith- 
mic fairness property can be estimated using statistical techniques, assuming the 
future behaves statistically like the past (the Markov assumption). 

As our main contribution, we present two different monitoring algorithms, 
using tools from frequentist and Bayesian statistics, respectively. The central 
idea of the frequentist monitor is that the probability of every transition of the 
monitored Markov chain M can be estimated using the fraction of times the 
transition is taken per visit to its source vertex. Building on this, we present a 
practical implementation of the frequentist monitor that can estimate the value 
of a given PSE from an observed finite sequence of states. For the coin example, 
after every new toss, the frequentist monitor will update its estimate of proba- 
bility of seeing heads by computing the fraction of times the coin came up heads 
so far, and then by using concentration bounds to find a tight error bound for 
a given confidence level. On the other hand, the central idea of the Bayesian 
monitor is that we begin with a prior belief about the transition probabilities of 
M, and having seen a finite sequence of observations, we can obtain an updated 
posterior belief about M. For a given confidence level, the output of the monitor 
is computed by applying concentration inequalities to find a tight error bound 
around the mean of the posterior belief. For the coin example, the Bayesian 
monitor will begin with a prior belief about the degree of fairness, and, after 
observing the outcome of each new toss, will compute a new posterior belief. 
If the prior belief agrees with the true model with a high probability, then the 
Bayesian monitor’s output converges to the true value of the PSE more quickly 
than the frequentist monitor. In general, both monitors can efficiently estimate 
more complicated PSEs, such as the ratio and the squared difference of the 
probabilities of heads of two different coins. The choice of the monitor for a par- 
ticular application depends on whether an objective or a subjective evaluation, 
with respect to a given prior, is desired. 

Both frequentist and Bayesian monitors use registers (and counters as a 
restricted class of registers) to keep counts of the relevant events and store the 
intermediate results. If the size of the given PSE is n, then, in theory, the fre- 
quentist monitor uses O(n+2”) registers and computes its output in O(n42”) 
time after each new observation, whereas the Bayesian monitor uses O(n?2") 
registers and computes its output in O(n?2") time after each new observation. 
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The computation time and the required number of registers get drastically 
reduced to O(n”) for the frequentist monitor with PSEs that contain up to one 
division operator, and for the Bayesian monitor with polynomial PSEs (possibly 
having negative exponents in the monomials). This shows that under given cir- 
cumstances, one or the other type of the monitor can be favorable computation- 
wise. These special, efficient cases cover many algorithmic fairness properties of 
interest, such as demographic parity and equal opportunity. 

Our experiments confirm that our monitors are fast in practice. Using a 
prototype implementation in Rust, we monitored a couple of decision-making 
systems adapted from the literature. In particular, we monitor if a bank is fair 
in lending money to applicants from different demographic groups [48], and if 
a college is fair in admitting students without creating an unreasonable finan- 
cial burden on the society [54]. In our experiments, both monitors took, on an 
average, less than a millisecond to update their verdicts after each observation, 
and only used tens of internal registers to operate, thereby demonstrating their 
practical usability at runtime. 

In short, we advocate that runtime verification introduces a new set of tools in 
the area of algorithmic fairness, using which we can monitor biases of deployed AI 
and machine-learned systems in real-time. While existing monitoring approaches 
only support sequential decision making problems and use only the frequentist 
statistical viewpoint, we present monitors for the more general class of Markov 
chain system models using both frequentist and Bayesian statistical viewpoints. 

All proofs can be found in the longer version of the paper [33]. 


1.1 Motivating Examples 


We first present two real-world examples from the algorithmic fairness literature 
to motivate the problem; these examples will later be used to illustrate the 
technical developments. 


The Lending Problem [48]: Suppose a bank lends money to individuals based 
on certain attributes, like credit score, age group, etc. The bank wants to max- 
imize profit by lending money to only those who will repay the loan in time— 
called the “true individuals.” There is a sensitive attribute (e.g., ethnicity) clas- 
sifying the population into two groups g and g. The bank will be considered fair 
(in lending money) if its lending policy is independent of an individual’s mem- 
bership in g or g. Several group fairness metrics from the literature are relevant 
in this context. Disparate impact [25] quantifies the ratio of the probability of 
an individual from g getting the loan to the probability of an individual from g 
getting the loan, which should be close to 1 for the bank to be considered fair. 
Demographic parity [21] quantifies the difference between the probability of an 
individual from g getting the loan and the probability of an individual from g 
getting the loan, which should be close to 0 for the bank to be considered fair. 
Equal opportunity |32] quantifies the difference between the probability of a true 
individual from g getting the loan and the probability of a true individual from 
g getting the loan, which should be close to 0 for the bank to be considered fair. 
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A discussion on the relative merit of various different algorithmic fairness notions 
is out of scope of this paper, but can be found in the literature [15, 22,43, 62]. 
We show how we can monitor whether a given group fairness criteria is fulfilled 
by the bank, by observing a sequence of lending decisions. 


The College Admission Problem [54]: Consider a college that announces a 
cutoff of grades for admitting students through an entrance examination. Based 
on the merit, every truly qualified student belongs to group g, and the rest 
to group g. Knowing the cutoff, every student can choose to invest a sum of 
money—proportional to the gap between the cutoff and their true merit—to be 
able to reach the cutoff, e.g., by taking private tuition classes. On the other hand, 
the college’s utility is in minimizing admission of students from g, which can be 
accomplished by raising the cutoff to a level that is too expensive to be achieved 
by the students from g and yet easy to be achieved by the students from g. The 
social burden associated to the college’s cutoff choice is the expected expense of 
every student from g, which should be close to 0 for the college to be considered 
fair (towards the society). We show how we can monitor the social burden, by 
observing a sequence of investment decisions made by the students from g. 


1.2 Related Work 


There has been a plethora of work on algorithmic fairness from the machine 
learning standpoint [10,12,21,32,38,42,45,46,52,59,63,66]. In general, these 
works improve algorithmic fairness through de-biasing the training dataset (pre- 
processing), or through incentivizing the learning algorithm to make fair deci- 
sions (in-processing), or through eliminating biases from the output of the 
machine-learned model (post-processing). All of these are interventions in the 
design of the system, whereas our monitors treat the system as already deployed. 

Recently, formal methods-inspired techniques have been used to guarantee 
algorithmic fairness through the verification of a learned model [2,9,29, 53,61], 
and enforcement of robustness [6,30,39]. All of these works verify or enforce 
algorithmic fairness statically on all runs of the system with high probability. 
This requires certain knowledge about the system model, which may not be 
always available. Our runtime monitor dynamically verifies whether the current 
run of an opaque system is fair. 

Our frequentist monitor is closely related to the novel work of Albarghouthi 
et al. [3], where the authors build a programming framework that allows run- 
time monitoring of algorithmic fairness properties on programs. Their monitor 
evaluates the algorithmic fairness of repeated “single-shot” decisions made by 
machine-learned functions on a sequence of samples drawn from an underly- 
ing unknown but fixed distribution, which is a special case of our more general 
Markov chain model of the generator. They do not consider the Bayesian point 
of view. Moreover, we argue and empirically show in Sect. 4 that our frequentist 
approach produces significantly tighter statistical estimates than their approach 
on most PSEs. On the flip side, their specification language is more expressive, 
in that they allow atomic variables for expected values of events, which is useful 
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for specifying individual fairness criteria [21]. We only consider group fairness, 
and leave individual fairness as part of future research. Also, they allow logical 
operators (like boolean connectives) in their specification language. However, we 
obtain tighter statistical estimates for the core arithmetic part of algorithmic 
fairness properties (through PSEs), and point out that we can deal with logical 
operators just like they do in a straightforward manner. 

Shortly after the first manuscript of this paper was written, we published 
a separate work for monitoring long-run fairness in sequential decision making 
problems, where the feature distribution of the population may dynamically 
change due to the actions of the individuals [34]. Although this other work 
generalizes our current paper in some aspects (support for dynamic changes in 
the model), it only allows sequential decision making models (instead of Markov 
chains) and does not consider the Bayesian monitoring perspective. 

There is a large body of research on monitoring, though the considered prop- 
erties are mainly temporal [5,7,19, 24,40, 50,60]. Unfortunately, these techniques 
do not directly extend to monitoring algorithmic fairness, since checking algo- 
rithmic fairness requires statistical methods, which is beyond the limit of finite 
automata-based monitors used by the classical techniques. Although there are 
works on quantitative monitoring that use richer types of monitors (with coun- 
ters/registers like us) [28,35,36,56], the considered specifications do not easily 
extend to statistical properties like algorithmic fairness. One exception is the 
work by Ferrére et al. [26], which monitors certain statistical properties, like 
mode and median of a given sequence of events. Firstly, they do not consider 
algorithmic fairness properties. Secondly, their monitors’ outputs are correct only 
as the length of the observed sequence approaches infinity (asymptotic guaran- 
tee), whereas our monitors’ outputs are always correct with high confidence 
(finite-sample guarantee), and the precision gets better for longer sequences. 

Although our work uses similar tools as used in statistical verification [1, 
4,14,17,64], the goals are different. In traditional statistical verification, the 
system’s runs are chosen probabilistically, and it is verified if any run of the 
system satisfies a boolean property with a certain probability. For us, the run 
is given as input to the monitor, and it is this run that is verified against a 
quantitative algorithmic fairness property with statistical error bounds. To the 
best of our knowledge, existing works on statistical verification do not consider 
algorithmic fairness properties. 


2 Preliminaries 


For any alphabet X, the notation ©* represents the set of all finite words over 
X. We write R, N, and Nt to denote the sets of real numbers, natural numbers 
(including zero), and positive integers, respectively. For a pair of real (natural) 
numbers a,b with a < b, we write [a,b] ([a..b]) to denote the set of all real 
(natural) numbers between and including a and b. For a given c,r € R, we write 
[c +r] to denote the set [e — r,c + r]. For simpler notation, we will use |- | to 
denote both the cardinality of a set and the absolute value of a real number, 
whenever the intended use is clear. 


364 T. A. Henzinger et al. 


For a given vector v € R” and a given m x n real matrix M, for some m,n, 
we write v; to denote the i-th element of v and write Mij to denote the element 
at the i-th row and the j-th column of M. For a given n € NT, a simplex 
is the set of vectors A(n) := {x e [0,1]"*1 | =. x; = 1}. Notice that the 
dimension of A(n) is n + 1 (and not n), a convention that is standard due to 
the interpretation of A(n) as the n + 1 vertices of an n-dimensional polytope. 
A stochastic matrix of dimension m x m is a matrix whose every row is in 
A(m-1), ie. M € A(m—1)™. Random variables will be denoted using uppercase 
symbols from the Latin alphabet (e.g. X), while the associated outcomes will 
be denoted using lowercase font of the same symbol (x is an outcome of X). We 
will interchangeably use the expected value E(X) and the mean ux of X. For a 
given set S, define D(S) as the set of every random variable—called a probability 
distribution'—with set of outcomes being 2°. A Bernoulli random variable that 
produces “1” (the alternative is “0”) with probability p is written as Bernoulli(p). 


2.1 Markov Chains as Randomized Generators of Events 


We use finite Markov chains as sequential randomized generators of events. A 
(finite) Markov chain M is a triple (Q, M,7), where Q = [1..N] is a set of 
states for a finite N, M € A(N — 1)” is a stochastic matrix called the transition 
probability matrix, and m € D(Q) is the distribution over initial states. We often 
refer to a pair of states (i, j) E€ Q x Q as an edge. The Markov chain M generates 
an infinite sequence of random variables Xo = 7,X1,..., with X; € D(Q) for 
every i, such that the Markov property is satisfied: P(Xn41 = iny1 | Xo = 
ip,---)Xn = in) = P(Xn41 = tng | Xn = tn), which is M;,;,,, in our case. 
A finite path T = £0,..., £n of M is a finite word over Q such that for every 
t € [0;n], P(X; = x+) > 0. Let Paths(M) be the set of every finite path of M. 

We use Markov chains to model the probabilistic interaction between a 
machine-learned decision maker with its environment. Intuitively, the Markov 
assumption on the model puts the restriction that the decision maker does not 
change over time, e.g., due to retraining. 

In Fig. 1 we show the Markov chains for the lending and the college admission 
examples from Sect. 1.1. The Markov chain for the lending example captures 
the sequence of loan-related probabilistic events, namely, that a loan applicant is 
randomly sampled and the group information (g or g) is revealed, a probabilistic 
decision is made by the decision-maker and either the loan was granted (gy or 
gy, depending on the group) or refused (7%), and if the loan is granted then with 
some probabilities it either gets repaid (z) or defaulted (Zz). The Markov chain 
for the college admission example captures the sequence of admission events, 
namely, that a candidate is randomly sampled and the group is revealed (g, 9), 
and when the candidate is from group g (truly qualified) then the amount of 
money invested for admission is also revealed. 


1 An alternate commonly used definition of probability distribution is directly in terms 
of the probability measure induced over S, instead of through the random variable. 
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Fig. 1. Markov chains for the lending and the college-admission examples. (left) The 
lending example: The state init denotes the initiation of the sampling, and the rest 
represent the selected individual, namely, g and g denote the two groups, (gy) and (gy) 
denote that the individual is respectively from group g and group g and the loan was 
granted, y denotes that the loan was refused, and z and Z denote whether the loan 
was repaid or not. (right) The college admission example: The state init denotes the 
initiation of the sampling, the states g,g represent the group identity of the selected 
candidate, and the states {0,..., N} represent the amount of money invested by a truly 
eligible candidate. 


2.2 Randomized Register Monitors 


Randomized register monitors, or simply monitors, are adapted from the (deter- 
ministic) polynomial monitors of Ferrére et al. [27]. Let R be a finite set of integer 
variables called registers. A function v: R — N assigning concrete value to every 
register in R is called a valuation of R. Let N? denote the set of all valuations 
of R. Registers can be read and written according to relations in the signature 
S = (0,1,+,—-,x,+,<). We consider two basic operations on registers: 


— A test is a conjunction of atomic formulas over S and their negation; 
— An update is a mapping from variables to terms over S. 


We use (R) and T(R) to respectively denote the set of tests and updates over 
R. Counters are special registers with a restricted signature S = (0,1,+,—,<). 


Definition 1 (Randomized register monitor). A randomized register mon- 
itor is a tuple (X, A, R,A,T) where X is a finite input alphabet, A is an output 
alphabet, R is a finite set of registers, A: NË — A is an output function, and 
T: Sx &(R) > D(I(R)) is the randomized transition function such that for 
every o € X and for every valuation v € N®, there exists a unique ġ € &(R) 
with v = ġ and T(o,¢) € D(T(R)). A deterministic register monitor is a ran- 
domized register monitor for which T(a,) is a Dirac delta distribution, if it is 


defined. 


A state of a monitor A is a valuation of its registers v € NË. The monitor 
A transitions from state v to a distribution over states given by the random 
variable Y = T(o,¢) on input o € X if there exists ¢ such that v = ¢. Let y 
be an outcome of Y with P(Y = y) > 0, in which case the registers are updated 
as u'(x) = v(y(x)) for every x € R, and the respective concrete transition is 
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written as v > v’. A run of A ona word wo... Wn € X* is a sequence of concrete 
transitions vo => v = ... “2+ Un41. The probabilistic transitions of A induce 
a probability distribution over the sample space of finite runs of the monitor, 
denoted P(-). For a given finite word w € X*, the semantics of the monitor A 
is given by a random variable [.A](w) := A(Y) inducing the probability measure 
P4, where Y is the random variable representing the distribution over the final 
state in a run of A on the word w, i.e., PA(Y = v) := P({r =10...1Tm E€ &™ | 
r is a run of Aon w and rm = v}). 


Example: A Monitor for Detecting the (Unknown) Bias of a Coin. We 
present a simple deterministic monitor that computes a PAC estimate of the bias 
of an unknown coin from a sequence of toss outcomes, where the outcomes are 
denoted as “h” for heads and “t” for tails. The input alphabet is the set of toss 
outcomes, i.e., X = {h, t}, the output alphabet is the set of every bias intervals, 
ie., IT = {[a,b] |O <a <b < 1}, the set of registers is R = {rn, rh}, where 
Tn and rp are counters counting the total number of tosses and the number of 
heads, respectively, and the output function A maps every valuation of rn,Th 
to an interval estimate of the bias that has the form A = (rr) /v(rn) + E(Tn, ô), 
where 6 € [0,1] is a given upper bound on the probability of an incorrect estimate 
and e(rn, 6) is the estimation error computed using PAC analysis. For instance, 
after observing a sequence of 67 tosses with 36 heads, the values of the registers 
will be v(rn) = 67 and v(ra) = 36, and the output of the monitor will be 
A(67,36) = 36/67 + e(n, ð) for some appropriate e(-). Now, suppose the next 
input to the monitor is h, in which case the monitor’s transition is given as 
T(h,-) = (rn + 1,72 +1), which updates the registers to the new values v'(r;,) = 
67 + 1 = 68 and v'(r;,) = 36 + 1 = 37. For this example, the tests (R) over the 
registers are redundant, but they can be used to construct monitors for more 
complex properties. 


3 Algorithmic Fairness Specifications and Problem 
Formulation 


3.1 Probabilistic Specification Expressions 


To formalize algorithmic fairness properties, like the ones in Sect. 1.1, we intro- 
duce probabilistic specification expressions (PSE). A PSE y over a given finite 
set Q is an algebraic expression with some restricted set of operations that uses 
variables labeled v;; with i, 7 E€ Q and whose domains are the real interval (0, 1]. 
The syntax of y is: 


x= ve {vzhijeq |E E] 1 +6, (1a) 
yr=KER|[Eleot+ele-—vle- vl (p), (1b) 
where {v;;}i,;eq are the variables with domain [0,1] and « is a constant. The 


expression € in (la) is called a monomial and is simply a product of powers of 
variables with integer exponents. A polynomial is a weighted sum of monomials 
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with constant weights.” Syntactically, polynomials form a strict subclass of the 
expressions definable using (1b), because the product of two polynomials is not a 
polynomial, but is a valid expression according to (1b). A PSE is division-free 
if there is no division operator involved in y. The size of an expression y is the 
total number of arithmatic operators (i.e. +,—,-,+) in y. We use V, to denote 
the set of variables appearing in the expression y, and for every V C V, we 
define Dom(V) := {i € Q | dui; € V V Sug: € V} as the set containing any state 
of the Markov chain that is involved in some variable in V. 

The semantics of a PSE ¢ is interpreted statically on the unknown Markov 
chain M: we write y(M) to denote the evaluation or the value of y by substitut- 
ing every variable vj; in y with Mij. E.g., for a Markov chain with state space 
{1,2} and transition probabilities My, = 0.2, Mig = 0.8, M21 = 0.4, and Mg. = 
0.6, the expression y = v11 — V2; has the evaluation y(M) = 0.2 — 0.4 = —0.2. 
We will assume that for every expression (1 + €), €(M/) # 0. 


Example: Group Fairness. Using PSEs, we can express the group fairness 
properties for the lending example described in Sect. 1.1, with the help of the 
Markov chain in the left subfigure of Fig. 1: 


Disparate impact [25]: Ugy + Ugy 
Demographic parity [21]: Ügy Üy 


The equal opportunity criterion requires the following probability to be close 
to zero: p = P(y | g,z) — P(y | g,z), which is tricky to monitor as p contains 
the counter-factual probabilities representing “the probability that an individual 


from a group would repay had the loan been granted.” We apply Bayes’ rule, 
P(z|g,y):P(ylg) _ P(zlg,y)-PCl9) 

P(z|g) P(zlg) 
Assuming P(z | g) = cı and P(z | J) = c2, where cı and c2 are known constants, 


the property p’ can be encoded as a PSE as below: 


and turn p into the following equivalent form: p' = 


Equal opportunity [32]: (Vigy)z Ugy) =c1— (Vgy)z' Vy) FOr 


Example: Social Burden. Using PSEs, we can express the social burden of 
the college admission example described in Sect. 1.1, with the help of the Markov 
chain depicted in the right subfigure of Fig. 1: 


Social burden [54]: l1-vg +... +N: vgn. 


3.2 The Monitoring Problem 


Informally, our goal is to build monitors that observe a single long path of a 
Markov chain and, after each observation, output a new estimate for the value 
of the PSE. Since the monitor’s estimate is based on statistics collected from 


2 Although monomials and polynomials usually only have positive exponents, we take 
the liberty to use the terminologies even when negative exponents are present. 
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a finite path, the output may be incorrect with some probability, where the 
source of this probability is different between the frequentist and the Bayesian 
approaches. In the frequentist approach, the underlying Markov chain is fixed 
(but unknown), and the randomness stems from the sampling of the observed 
path. In the Bayesian approach, the observed path is fixed, and the randomness 
stems from the uncertainty about a prior specifying the Markov chain’s param- 
eters. The commonality is that, in both cases, we want our monitors to estimate 
the value of the PSE up to an error with a fixed probabilistic confidence. 

We formalize the monitoring problem separately for the two approaches. A 
problem instance is a triple (Q,y,6), where Q = [1.. N] is a set of states, ọ is 
a PSE over Q, and 6 € [0,1] is a constant. In the frequentist approach, we use 
P, to denote the probability measure induced by sampling of paths, and in the 
Bayesian approach we use Pg to denote the probability measure induced by the 
prior probability density function pg: A(n — 1)” — RU {co} over the transition 
matrix of the Markov chain. In both cases, the output alphabets of the monitors 
contain every real interval. 


Problem 1 (Frequentist monitor). Suppose (Q,y,6) is a problem instance 
given as input. Design a monitor A such that for every Markov chain M with 
transition probability matrix M and for every finite path T € Paths(M): 


Ps, a (p(M) € [A](@)) > 1-6, (2) 
where Ps a is the joint probability measure of P, and P4. 


Problem 2 (Bayesian monitor). Suppose (Q, p, ð) is a problem instance and 
po is a prior density function, both given as inputs. Design a monitor A such 
that for every Markov chain M with transition probability matrix M and for 
every finite path £ € Paths( M): 


Po a (p(M) € [A](%) | 7) > 1- ô, (3) 
where Po 4 is the joint probability measure of Pg and Py. 


Notice that the state space of the Markov chain and the input alphabet of the 
monitor are the same, and so, many times, we refer to observed states as (input) 
symbols, and vice versa. The estimate [!, u] = [A] (7X) is called the (1 — 8) - 100% 
confidence interval for y(M).° The radius, given by £ = 0.5- (u — l), is called the 
estimation error, and the quantity 1 — ô is called the confidence. The estimate 
gets more precise as the error gets smaller and the confidence gets higher. 

In many situations, we are interested in a qualitative question of the form 
“is p(M) < c?” for some constant c. We point out that, once the quantitative 
problem is solved, the qualitative questions can be answered using standard 
procedures by setting up a hypothesis test |44, p. 380]. 


3 While in the Bayesian setting credible intervals would be more appropriate, we 
use confidence intervals due to uniformity and the relative ease of computation. To 
relate the two, our confidence intervals are over-approximations of credible intervals 
(non-unique) that are centered around the posterior mean. 
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4 Frequentist Monitoring 


Suppose the given PSE is only a single variable ọ = v;;j, i.e., we are monitoring 
the probability of going from state 7 to another state j. The frequentist monitor 
A for y can be constructed in two steps: (1) empirically compute the average 
number of times the edge (i, j) was taken per visit to the state i on the observed 
path of the Markov chain, and (2) compute the (1 — ô) - 100% confidence interval 
using statistical concentration inequalities. 

Now consider a slightly more complex PSE y’ = 
Vij + Vik. One approach to monitor y’, proposed by 
Albarghouthi et al. [3], would be to first compute the 
(1 — ô) - 100% confidence intervals |l, ui] and [l2, u2] 
separately for the two constituent variables v;; and 
Vik, respectively. Then, the (1 — 26) - 100% confidence 
interval for vy’ would be given by the sum of the two °° 
intervals [l,, u1] and [lo, u2], i.e., l1 +l2, u1 +ug]; notice 
the drop in overall confidence due to the union bound. 
The drop in the confidence level and the additional Sf the est. error using the 
error introduced by the interval arithmetic accumulate existing approach [3] to 
quickly for larger PSEs, making the estimate unus- est. error using our app- 
able. Furthermore, we lose all the advantages of hav- roach, w.r.t. the size of the 
ing any dependence between the terms in the PSE. For chosen PSE. 
instance, by observing that vij and vik correspond to 
the mutually exclusive transitions 7 to j and i to k, we know that y’(M) is 
always less than 1, a feature that will be lost if we use plain merging of individ- 
ual confidence intervals for vj; and vik. We overcome these issues by estimating 
the value of the PSE as a whole as much as possible. In Fig. 2, we demonstrate 
how the ratio between the estimation errors from the two approaches vary as 
the number of summands (i.e., n) in the PSE y = 37", vin changes; in both 
cases we fixed the overall ô to 0.05 (95% confidence). The ratio remains the same 
for different observation lengths. Our approach is always at least as accurate as 
their approach [3], and is significantly better for larger PSEs. 


12 


relative estimation error 


0 10 20 30 40 50 
no. of summands 


Fig. 2. Variation of ratio 


4.1 The Main Principle 


We first explain the idea for division-free PSEs, i.e., PSEs that do not involve 
any division operator; later we extend our approach to the general case. 


Divison-Free PSEs: In our algorithm, for every variable vi; € Vp, we introduce 
a Bernoulli(M;j) random variable Y* with the mean M;; unknown to us. We 
make an observation yy for every p-th visit to the state 7 on a run, and if j follows 
immediately afterwards then record y; = 1 else record y = 0. This gives us 
a sequence of observations Y“ = yit ; ys ,--. corresponding to the sequence of 
iid. random variables YJ = Y;",Y3",.... For instance, for the run 121123 we 


obtain y1? = 1,0,1 for the variable v12. 
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The heart of our algorithm is an aggregation procedure of every sequence of 
random variable {y* IY ev, to a single i.i.d. sequence W of an auxiliary random 
variable W, such that the mean of W is yw = E(W) = (M). We can then use 
known concentration inequalities on the sequence W to estimate uw. Since uw 
exactly equals y(M) by design, we obtain a tight concentration bound on (M). 
We informally explain the main idea of constructing wW using simple examples; 
the details can be found in Algorithm 2. 


Sum and Difference: Let p = vij + vs. We simply combine Yï and Y*! 
as W, = Y# + Yf’, so that wp = yt + yk! is the corresponding observation 
of Wp. Then pw, = (M) holds, because uw, = E(W,) = E(Y# + Y) = 
o(¥ 77) + E( YF) = Mij + My. Similar approach works for p = vij — Vki- 


Multiplication: For multiplications, the same linearity principle will not always 
work, since for random variables A and B, E(A- B) = E(A)-E(B) only if A and 
B are statistically independent, which will not be true for specifications of the 
form y = vij: Vix. In this case, the respective Bernoulli random variables Yj’ and 
Y" are dependent: P(Y} = 1)-P(Y;* = 1) = Mij- Mix, but P(Y} = 1AY}* = 1) 
is always 0 (since both j and k cannot be visited following the p-th visit to i). 
To benefit from independence once anain, we temporally al one of the 
random variables by defining W, = Yz - Yk a, with wp = yo), - ¥5p41- Since the 


random variables Nae and Yi 41 are denada as they use separate visits of 
state 7, hence we obtain uw, = Mi; - Mix. For independent multiplications of 
the form y = vij - vp with i # k, we can simply use Wp = Yj - Yp". 

In general, we use the ideas of aggregation and temporal shift on the syntax 
tree of the PSE y, inductively. With an aggregated sequence of observations 
for the auxiliary variable W for ọ, we can find an estimate for y(M) using the 
Hoeffding’s inequality. We present the detailed algorithm of this monitor, namely 
FreqMonitorDivFree, in Algorithm 1. 


The General Case (PSEs With Division Operators): We observe that 
every arbitrary PSE y of size n can be transformed into a semantically equivalent 
PSE of the form Ya + = of size O(n?2"), where Ya, pb, and 9e are all division- 
free. Once in this form, we can employ three different FreqMonitorDivFree 
monitors from Algorithm 1 to obtain separate interval estimates for Ya, Yb, and 
Pe, Which are then combined using standard interval arithmetic and the resulting 
confidence of the estimate is obtained through the union bound. The steps for 
constructing the (general-case) FrequentistMonitor are shown in Algorithm 2, 
and the detailed analysis can be found in the proof of Theorem 1. 


Bounding Memory: Consider a PSE y = vij + vg. The outcome wp for p can 
only be computed when both the Bernoulli outcomes y/ and yp! are available. If 
at any point only one of the two is available, then we need to store the available 
one so that it can be used later when the other one gets available. It can be 
shown that the storage of “unmatched” outcomes may need unbounded memory. 

To bound the memory, we use the insight that a random reshuffling of the 
i.i.d. sequence yi would still be i.i.d. with the same distribution, so that we do 
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not need to store the exact order in which the outcomes appeared. Instead, for 
every vij E Vy, we only store the number of times we have seen the state 7 and 
the edge (i, j) in counters c; and c;;, respectively. Observe that c; > esa Cik, 
where the possible difference accounts for the visits to irrelevant states, denoted 
as a dummy state T. Given {Cik }ķ, whenever needed, we generate in x; a random 
reshuffling of the sequence of states, together with T, seen after the past visits 
to i. From the sequence stored in 2;, for every Vik € Vp, we can consistently 
determine the value of yi* (consistency dictates y} = 1 = y} = 0). Moreover, 
we reuse space by resetting x; whenever the sequence stored in x; is no longer 
needed. It can be shown that the size of every x; can be at most the size of 
the expression [33, Proof of Thm. 2]. This random reshuffling of the observation 
sequences is the cause of the probabilistic transitions of the frequenitst monitor. 


4.2 Implementation of the Frequentist Monitor 


Fix a problem instance (Q,v,6), with size of y being n. Let y be transformed 
into p’ by relabeling duplicate occurrences of v;; using distinct labels vj,,v7,,...-. 
The set of labeled variables in y' is V}, and |V{| = O(n). Let SubExpr(y) denote 
the set of every subexpression in the expression y, and use [/,,, uy] to denote the 
range of values the expression y can take for every valuation of every variable 
as per the domain [0,1]. Let Dep(y) = {i | du;; E Vp}, and every subexpression 
pı: p2 with Dep(yi) N Dep(y2) # I is called a dependent multiplication. 

Implementation of FreqMonitorDivFree in Algorithm 1 has two main func- 
tions. Init initializes the registers. Next implements the transition function of 
the monitor, which attempts to compute a new observation w for W (Line 4) 
after observing a new input o’, and if successful it updates the output of the 
monitor by invoking the UpdateEst function. In addition to the registers in Init 
and Next labeled in the pseudocode, following registers are used internally: 


— xi, 1 € Dom(V,): reshuffled sequence of states that followed i. 
- tis: the index of x; that was used to obtain the latest outcome of vl. 


Now, we summarize the main results for the frequentist monitor. 


Theorem 1 (Correctness). Let (Q, p,ô) be a problem instance. Algorithm 2 
implements a monitor for (Q, p,8) that solves Problem 1. 


Theorem 2 (Computational resources). Let (Q, p,ô) be a problem instance 
and A be the monitor implemented using the FrequentistMonitor routine of 
Algorithm 2. Suppose the size of y is n. The monitor A requires O(n*2?") reg- 
isters, and takes O(n42?”) time to update its output after receiving a new input 
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Algorithm 1. FreqMonitorDivFree 


Parameters: Q,,6 


Output: A 1: function Nezt(o’) 
1: function Init(c) 2: Co — Co +1 
2: wal unique labeling 3: Oe ee 
: l 
3: for all vij € Vp do 4: w — Eval(¢’) 
4: cij — 0 5: if w A L then 
5: co, 0 6: n=n+1 
6: ne0 i A — UpdateEst(w, n) 
7: PEDS 8: ResetX () 
8: bao L 9: aca’ 
9: eae 10: return A 
10: ResetX () 
11: Compute lọ, uy 
1: function Eval(y') 
2: if tot L then 1 
3: if y! = y} +} then a finetion e ae 
4: rg = Eval(y),) + Eval(y) HA — 
u = 2 
5: else if p =y -e then 3: EAH — Gente)” in (4) 
6: es Eval(y),) — Eval(y}) 4 return [14 e4] 
T: else if y! = y! : eh then ; 
8: if Dep(V3,) nN Dep(Vz.) = Ø then 1: function ExtractOutcome(x;, t) 
9: Top Eval(y}) - Eval(p}) 
10: else 2: Let U — {j E Q | vij E Vo} 
11: for vi; E V, N Dep(V},) do 3: for p = |z;|+1,...,t do 
12: tiy — max({t7, | vm € v% }) 4: qe VueU. i 
l l 1 pick u w/ prob. Siu, 

13: tig = tij +1 aiei 
14: Bval(p!) - Eval(y! pick T w/ prob, CHi tin) 

: rı — Bval(pt) - Eval( eh) a 

, : C225 oo oT 
15: else if p = Vig then 6: if q#T then 
16: if x(t}, +1] = 1 then 7: ciq — ciq — 1 
17: ExtractOutcome(z£i, i +1) 8 xilleil +1] q 
18: if x(t}, +1] =j #1 then 
19: ri 1: function Resetx () 
20: else 2 for all i € Dom(V,) do 
21: rot — 0 3: zi— Ó 

: i 

j F ol 
else fy = c then 4 for all vl € v} do 
a reuse 5 ti 6 


return r | 
p 


Algorithm 2. FrequentistMonitor 


Parameters: Q,,6 
Output: A 


1: function Init(o) 1: function Neat(o") j 
2; | p „change form ņ labeling 2: [Ha + €a] — Aa-Neat(o’) 
: Pa T Ge P 3: [ue +E] — Ap. Next(o’) 
3: Aa <— FreqMonitorDivFree(Q, pa, 5/3) 4: [Me + ee] — Ac. Next(c’) 
4: Ap — FreqMonitorDivFree(Q, pp, 5/3) 5: if fa LA py LA pe | then 
5: Ac <— FreqMonitorDivFree(Q, pc, 5/3) 6: lua EEA lua tea] + [up ten] 
6: s F [Me tec] 
5 Aa.-Init(o) 
7: Aj -Init(c) T: return [u4 + €,] 
8: Ac. Init(c) 
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symbol. For the special case of p containing at most one division operator (divi- 
sion by constant does not count), A requires only O(n?) registers, and takes only 
O(n) time to update its output after receiving a new input symbol. 


There is a tradeoff between the estimation error, the confidence, and the 
length of the observed sequence of input symbols. For instance, for a fixed con- 
fidence, the longer the observed sequence is, the smaller is the estimation error. 
The following theorem establishes a lower bound on the length of the sequence 
for a given upper bound on the estimation error and a fixed confidence. 


Theorem 3 (Convergence speed). Let (Q,y,6) be a problem instance where 
p does not contain any division operator, and let A be the monitor computed 
using Algorithm 2. Suppose the size of p is n. For a given upper bound on 
estimation error é E R, the minimum number of visits to every state in Dom(V.) 
for obtaining an output with error at most € and confidence at least 1— ô on any 
path is given by: 

(uy — ly)? In (3) n (4) 

2g? l 


where |lp, uy] is the set of possible values of p for every valuation of every vari- 
able (having domain [0,1]) in vy. 


The bound follows from the Hoeffding’s inequality, together with the fact 
that every dependent multiplication increments the required number of samples 
by 1. A similar bound for the general case with division is left open. 


5 Bayesian Monitoring 


Fix a problem instance (Q = [1.. N], p, 6). Let M = A(N —1)™ be the shorthand 
notation for the set of transition probability matrices of the Markov chains with 
state space Q. Let pọ: M — [0,1] be the prior probability density function 
over M, which is assumed to be specified using the matrix beta distribution 
(the definition can be found in standard textbooks on Bayesian statistics [37, 
pp. 280]). Let - be a matrix, with its size dependent on the context, whose every 
element is 1. We make the following common assumption [31,37, p. 50]: 


Assumption 1 (Prior). We are given a parameter matrix 6 > ¥, and po is 
specified using the matrix beta distribution with parameter 0. Moreover, the initial 
state of the Markov chain is fixed. 
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When 6 = , then pg is the uniform density function over M. After observing 
a path 2, using Bayes’ rule we obtain the posterior density function pg(- | 
7T), which is known to be efficiently computable due to the so-called conjugacy 
property that holds due to Assumption 1. From the posterior density, we obtain 
the expected posterior semantic value of y as: Eg(y(M) | 7) = fy y(M)- 
pe(M | a)dM. The heart of our Bayesian monitor is an efficient incremental 
computation of Eg(y(M) | z)—free from numerical integration. Once we can 
compute Eg(y(M) | X), we can also compute the posterior variance $? of y(M) 
using the known expression S? = Eg(y?(M) | %) —Eo(y(M) | T), which enables 
us to compute a confidence interval for y(M) using the Chebyshev’s inequality. 
In the following, we summarize our procedure for estimating Eg(y(M) | 7’). 


5.1 The Main Principle 


The incremental computation of Eg(y(M@) | 7) is implemented in 
BayesExpMonitor. We first transform the expression ọ into the polynomial form 
yp’ = J kié, where {K;}, are the weights and {€,}; are monomials. If the size of 
y is n then the size of y’ is O(n22). Then we can use linearity to compute the 
overall expectation as the weighted sum of expectations of the individual mono- 
mials: Eg(y(M) | T) = Eo(y’(M) | £) = X}; kiEo(€:(M) | 7). In the following, 
we summarize the procedure for estimating Eg(€(M) | 2’) for every monomial £. 

Let € be a monomial, and let Yab € Q* be a sequence of states. We use 
dij to store the exponent of the variable vj; in the monomial é, and define 
da = ye lL..N] daj- Also, we record the sets of (i, j)-s and i-s with positive 
and negative dij and d; entries: D} := {j | dij > 0}, D7 = {j | dij < 0}, 
Dt = {i | di > 0}, and D7 := {i | di < O}. 

For any given word wÙ € Q*, let c;(w) denote the number of ij-s in w and 
let (wÑ) = } jeg cij (Ù). Define &(W) = ¢(W) + Viet Fig and Gij(W) = 
cij(Ù) + bij. Let H: Q* — R be defined as: 


djl Then- "Pæ (@)-1)ld:| 

N 7 ; 
di) Tier Ijen; Pesw-yldsl 
(5) 


where "P,,k := TEK is the number of permutations of k > 0 items from n > 0 


objects, for k < n, and we use the convention that for S = 0, [],eg--. = 1. 
Below, in Lemma 1, we establish that Eg(€(/) | w) = H(w), and present an 
efficient incremental scheme to compute Eg(€(Z) | Zab) from E9((M) | Ta). 


He I; + Peu- 
H(i) = tiljepe + (Gig (B)—-1)+1di5| 


Hiep+ Peecm-n+ail 


Lemma 1 (Incremental computation of E(- | -)). If the following consis- 
tency condition 


is met, then the following holds: 


1(€(M) | Zab) = H( ab) = H(Za) - 


> 


Cab (2) + dab Cal T) 
Tabl T) Cal T) +da 


(7) 
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Algorithm 3. BayesExpMonitor 


1: function Nezt(c’) 
2 to — TĒ, +1 
3: Too’ = Tog! +1 
Parameters: Q, = Dii Kiĝ, 0 4 if active = false then 
Output: E 5 if (Vvij € Vo . Tij + Mij > 0) then 
1: function Init(o = 1) 6: active — true q: 6 1 
2: for vij E V, do Te for l € [1..p] do 1. 5 
j p Pp 
3: Cig — O53 8: h! — H! ({tij hijs {Gi}i) 
4: Gi D jep..n] Êi 9 else 
5: Mij — minie[i..p] di; 10 for l € [1.. p] do ! 7 
6: active — false 1. 6 11 h! hl. Soar T Fegi i Saat 
7: ae Evel Eg —1+d} 
8: Be a 12: if active = true then 
13: ESP, Kh! 
14: oto’ 


15: return E 


Condition (6) guarantees that the permutations in (5) are well-defined. The 
first equality in (7) follows from Marchal et al. [51], and the rest uses the conju- 
gacy of the prior. Lemma 1 forms the basis of the efficient update of our Bayesian 
monitor. Observe that on any given path, once (6) holds, it continues to hold for- 
ever. Thus, initially the monitor keeps updating H internally without outputting 
anything. Once (6) holds, it keeps outputting H from then on. 


5.2 Implementation of the Bayesian Monitor 


We present the Bayesian monitor implementation in BayesConfIntMonitor 
(Algorithm 4), which invokes BayesExpMonitor (Algorithm 3) as subroutine. 
BayesExpMonitor computes the expected semantic value of an expression y in 
polynomial form, by computing the individual expected value of each monomial 
using Propostion 1, and combining them using the linearity property. We drop 
the arguments from ¢;(-) and ¢;;(-) and simply write ¢; and ¢,; as constants asso- 
ciated to appropriate words. The symbol m,,; in Line 5 of Init is used as a book- 
keeping variable for quickly checking the consistency condition (Eq. 6) in Line 5 
of Next. In BayesConfIntMonitor, we compute the expected value and the vari- 
ance of y, by invoking BayesExpMonitor on ọ and y? respectively, and then com- 
pute the confidence interval using the Chebyshev’s inequality. It can be observed 
in the Next subroutines of BayesConfIntMonitor and BayesExpMonitor that a 
deterministic transition function suffices for the Bayesian monitors. 


Theorem 4 (Correctness). Let (Q,y,6) be a problem instance, and pg be 
given as the prior distribution which satisfies Assumption 1. Algorithm 4 pro- 
duces a monitor for (Q,y,6) that solves Problem 2. 


Theorem 5 Computational resources). Let (Q,y,6) be a problem instance 
and A be the monitor computed using the BayesConfIntMonitor routine of 


376 T. A. Henzinger et al. 


Algorithm 4. BayesConfIntMonitor 
Parameters: Q,,0 


function Nezt(c’) 


Output: A : R 
1: function Init(o = 1) 3 a ner 

. — polyn. —z „Polyn. 2 : : 
2 p Ps P P 4: if 24 Land E2# L then 
3: EXP <— BayesExpMonitor(Q, @, 0) 5: S — E2 — E? 
4: EXP2 < BayesExpMonitor(Q, p?, 0) 

PP; ees 

5: EXP. Init(c) 6 Ae [e + v3] 
A PA his T return A 


Algorithm 4. Suppose the size of p is n. The monitor A requires O(n?2") reg- 
isters, and takes O(n?2”) time to update its output after receiving a new input 
symbol. For the special case of p being in polynomial form, A requires only O(n?) 
registers, and takes only O(n?) time to update its output after receiving a new 
input symbol. 


A bound on the convergence speed of the Bayesian monitor is left open. This 
would require a bound on the change in variance with respect to the length 
of the observed path, which is not known for the general case of PSEs. Note 
that the efficient (quadratic) cases are different for the frequentist and Bayesian 
monitors, suggesting the use of different monitors for different specifications. 


6 Experiments 


We implemented our frequentist and Bayesian monitors in a tool written in Rust, 
and used the tool to design monitors for the lending and the college admission 
examples taken from the literature [48,54] (described in Sect. 1.1). The gener- 
ators are modeled as Markov chains (see Fig. 1)—unknown to the monitors— 
capturing the sequential interactions between the decision-makers (i.e., the bank 
or the college) and their respective environments (i.e., the loan applicants or the 
students), as described by D’Amour et al. [16]. The setup of the experiments is as 
follows: We created a multi-threaded wrapper program, where one thread simu- 
lates one long run of the Markov chain, and a different thread executes the moni- 
tor. Every time a new state is visited by the Markov chain on the first thread, the 
information gets transmitted to the monitor on the second thread, which then 
updates the output. The experiments were run on a Macbook Pro 2017 equipped 
with a 2,3 GHz Dual-Core Intel Core i5 processor and 8GB RAM. The tool can 
be downloaded from the following url, where we have also included the scripts to 
reproduce our experiments: https: //github.com/ista-fairness-monitoring/fmlib. 

We summarize the experimental results in Fig. 3, and, from the table, observe 
that both monitors are extremely lightweight: they take less than a millisecond 
per update and small numbers of registers to operate. From the plots, we observe 
that the frequentist monitors’ outputs are always centered around the ground 
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truth values of the properties, empirically showing that they are always objec- 
tively correct. On the other hand, the Bayesian monitors’ outputs can vary dras- 
tically for different choices of the prior, empirically showing that the correctness 
of outputs is subjective. It may be misleading that the outputs of the Bayesian 
monitors are wrong as they often do not contain the ground truth values. We 
reiterate that from the Bayesian perspective, the ground truth does not exist. 
Instead, we only have a probability distribution over the true values that gets 
updated after observing the generated sequence of events. The choice of the type 
of monitor ultimately depends on the application requirements. 


e Bayesian (uniform prior) 
æ Bayesian (arbitrary prior) 
a Frequentist 


& 05 8 g 
£ z e E Da 
3 0.0 8 8 0.0 i 
5 a ae 
a -0.5 3 3 

-0.2 

Ba 20K 40K 60K 80K 100K ~ o 20K 40K 60K 80K 100K oaa 20K 40K 60K 80K 100K 
time time time 
Steland Size of Av. comp. time/step # registers 
expression) Freq. Bayes. Freq. Bayes. 

Lending (bias) + dem. par. 1 13.0ps 29.3ps 15 17 
Lending (fair) + eq. opp. 5 21.6ps 31.0ps 29 27 
Admission + soc. burden 19 53.8ps 184.6ps 46 102 


Fig. 3. The plots show the 95% confidence intervals estimated by the monitors over 
time, averaged over 10 different sample paths, for the lending with demographic parity 
(left), lending with equalized opportunity (middle), and the college admission with 
social burden (right) problems. The horizontal dotted lines are the ground truth values 
of the properties, obtained by analyzing the Markov chains used to model the systems 
(unknown to the monitors). The table summarizes various performance metrics. 


7 Conclusion 


We showed how to monitor algorithmic fairness properties on a Markov chain 
with unknown transition probabilities. Two separate algorithms are presented, 
using the frequentist and the Bayesian approaches to statistics. The perfor- 
mances of both approaches are demonstrated, both theoretically and empirically. 

Several future directions exist. Firstly, more expressive classes of properties 
need to be investigated to cover a broader range of algorithmic fairness criteria. 
We believe that boolean logical connectives, as well as min and max operators 
can be incorporated straightforwardly using ideas from the related literature [3]. 
This also adds support for absolute values, since |z| = max{z, —x}. On the other 
hand, properties that require estimating how often a state is visited would require 
more information about the dynamics of the Markov chain, including its mixing 
time. Monitoring statistical hyperproperties [18] is another important direction, 
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which will allow us to encode individual fairness properties [21]. Secondly, more 
liberal assumptions on the system model will be crucial for certain practical 
applications. In particular, hidden Markov models, time-inhomogeneous Markov 
models, Markov decision processes, etc., are examples of system models with 
widespread use in real-world applications. Finally, better error bounds tailored 
for specific algorithmic fairness properties can be developed through a deeper 
mathematical analysis of the underlying statistics, which will sharpen the con- 
servative bounds obtained through off-the-shelf concentration inequalities. 
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Abstract. A rigorous formalization of desired system requirements is 
indispensable when performing any verification task. This often limits 
the application of verification techniques, as writing formal specifications 
is an error-prone and time-consuming manual task. To facilitate this, 
we present nl2spec, a framework for applying Large Language Models 
(LLMs) to derive formal specifications (in temporal logics) from unstruc- 
tured natural language. In particular, we introduce a new methodology 
to detect and resolve the inherent ambiguity of system requirements in 
natural language: we utilize LLMs to map subformulas of the formaliza- 
tion back to the corresponding natural language fragments of the input. 
Users iteratively add, delete, and edit these sub-translations to amend 
erroneous formalizations, which is easier than manually redrafting the 
entire formalization. The framework is agnostic to specific application 
domains and can be extended to similar specification languages and new 
neural models. We perform a user study to obtain a challenging dataset, 
which we use to run experiments on the quality of translations. We pro- 
vide an open-source implementation, including a web-based frontend. 


1 Introduction 


A rigorous formalization of desired system requirements is indispensable when 
performing any verification-related task, such as model checking [7], synthesis [6], 
or runtime verification [20]. Writing formal specifications, however, is an error- 
prone and time-consuming manual task typically reserved for experts in the field. 
This paper presents nl2spec, a framework, accompanied by a web-based tool, 
to facilitate and automate writing formal specifications (in LTL [34] and similar 
temporal logics). The core contribution is a new methodology to decompose 
the natural language input into sub-translations by utilizing Large Language 
Models (LLMs). The nl2spec framework provides an interface to interactively 
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A Framework for Translating Unstructured Natural Language to Temporal Logics with Large Language Models Home About 


Prompt 


Translate this sentence to L 
Globally, grant 0 and grant 1 do not hold at the same time until it is allowed. 


Model: codex Prompt: generic Number of tries: 3 Temperature: 0.20 
e 
Subtranslations E Add Subtranslation | Ê Delete All 
o 
Translate and Z to & P. 33.33% 0) 
g 
Translate until ZL to U Z 33.33% Ww 
Translate _ it is allowed Zo a P 33.33% 0 
Translate do not hold at the same time Z to ~(g0&g1) A 66.67% ð 
© 
Translate globally Zz to G Z 66.67% g 
yy ® 
Translate grantO ZL to g0 Zz 100% j 
y ð 
Translate grant1 A to gi Z 100% ® 
Final Result 
G((!((g0 & g1)) U a)) 2 100.0% 


Fig. 1. A screenshot of the web-interface for nl2spec. 


add, edit, and delete these sub-translations instead of attempting to grapple with 
the entire formalization at once (a feature that is sorely missing in similar work, 
e.g., [13,30]). 

Figure 1 shows the web-based frontend of nl2spec. As an example, we con- 
sider the following system requirement given in natural language: “Globally, 
grant 0 and grant 1 do not hold at the same time until it is allowed”. The tool 
automatically translates the natural language specification correctly into the 
LTL formula G((!((gO & g1)) U a)). Additionally, the tool generates sub- 
translations, such as the pair (“do not hold at the same time”, ! (g0 & g1)), 
which help in verifying the correctness of the translation. 

Consider, however, the following ambiguous example: “a holds until b holds 
or always a holds”. Human supervision is needed to resolve the ambiguity on 
the operator precedence. This can be easily achieved with nl2spec by adding or 
editing a sub-translation using explicit parenthesis (see Sect.4 for more details 
and examples). To capture such (and other types of) ambiguity in a benchmark 
data set, we conducted an expert user study specifically asking for challenging 
translations of natural language sentences to LTL formulas. 

The key insight in the design of nl2spec is that the process of translation 
can be decomposed into many sub-translations automatically via LLMs, and 
the decomposition into sub-translations allows users to easily resolve ambigu- 
ous natural language and erroneous translations through interactively modifying 
sub-translations. The central goal of nl2spec is to keep the human supervision 
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minimal and efficient. To this end, all translations are accompanied by a con- 
fidence score. Alternative suggestions for sub-translations can be chosen via a 
drop-down menu and misleading sub-translations can be deleted before the next 
loop of the translation. We evaluate the end-to-end translation accuracy of our 
proposed methodology on the benchmark data set obtained from our expert 
user study. Note that nl2spec can be applied to the user’s respective appli- 
cation domain to increase the quality of translation. As proof of concept, we 
provide additional examples, including an example for STL [31] in the GitHub 
repository!. 

nl2spec is agnostic to machine learning models and specific application 
domains. We will discuss possible parameterizations and inputs of the tool in 
Sect.3. We discuss our sub-translation methodology in more detail in Sect. 3.2 
and introduce an interactive few-shot prompting scheme for LLMs to generate 
them. We evaluate the effectiveness of the tool to resolve erroneous formaliza- 
tions in Sect.4 on a data set obtained from conducting an expert user study. 
We discuss limitations of the framework and conclude in Sect. 5. For additional 
details, please refer to the complete version [8]. 


2 Background and Related Work 


2.1 Natural Language to Linear-Time Temporal Logic 


Linear-time Temporal Logic (LTL) [34] is a temporal logic that forms the basis 
of many practical specification languages, such as the IEEE property specifica- 
tion language (PSL) [22], Signal Temporal Logic (STL) [31], or System Verilog 
Assertions (SVA) [43]. By focusing on the prototype temporal logic LTL, we 
keep the nl2spec framework extendable to specification languages in specific 
application domains. LTL extends propositional logic with temporal modalities 
U (until) and X (next). There are several derived operators, such as Fy = trueUp 
and Gp = -F-y. Fy states that y will eventually hold in the future and 
Gy states that y holds globally. Operators can be nested: GFy, for example, 
states that y has to occur infinitely often. LTL specifications describe a sys- 
tems behavior and its interaction with an environment over time. For exam- 
ple given a process 0 and a process 1 and a shared resource, the formula 
G(ro — Fgo) A G(rı — Fai) A Ga(go A gi) describes that whenever a process 
requests (r;) access to a shared resource it will eventually be granted (g;). The 
subformula G=(go A gi) ensures that grants given are mutually exclusive. 

Early work in translating natural language to temporal logics focused on 
grammar-based approaches that could handle structured natural language [17, 
24]. A survey of earlier research before the advent of deep learning is provided 
in [4]. Other approaches include an interactive method using SMT solving and 
semantic parsing [15], or structured temporal aspects in grounded robotics [45] 
and planning [32]. Neural networks have only recently been used to translate 


1 The tool is available at GitHub: https://github.com/realChrisHahn2/nl2spec. 
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into temporal logics, e.g., by training a model for STL from scratch [21], fine- 
tuning language models [19], or an approach to apply GPT-3 [13,30] in a one- 
shot fashion, where [13] output a restricted set of declare templates [33] that 
can be translated to a fragment of LTLf [10]. Translating natural langauge to 
LTL has especially been of interest to the robotics community (see [16] for an 
overview), where datasets and application domains are, in contrast to our setting, 
based on structured natural language. Independent of relying on structured data, 
all previous tools lack a detection and interactive resolving of the inerherent 
ambiguity of natural language, which is the main contribution of our framework. 
Related to our approach is recent work [26], where generated code is iteratively 
refined to match desired outcomes based on human feedback. 


2.2 Large Language Models 


LLMs are large neural networks typically consisting of up to 176 billion parame- 
ters. They are pre-trained on massive amounts of data, such as “The Pile” [14]. 
Examples of LLMs include the GPT [36] and BERT [11] model families, open- 
source models, such as T5 [38] and Bloom [39], or commercial models, such as 
Codex [5]. LLMs are Transformers [42], which is the state of the art neural archi- 
tecture for natural language proccessing. Additionally, Transformers have shown 
remarkable performance when being applied to classical problems in verification 
(e.g., [9, 18,25, 40]), reasoning (e.g., [28,50]), as well as the auto-formalization [35] 
of mathematics and formal specifications (e.g., [19,21,49]). 

In language modelling, we model the probability of a sequence of tokens in a 
text [41]. The joint probability of tokens in a text is generally expressed as [39]: 


T 


p(t) = plen- 27) = [] pled) , 


where x is the sequence of tokens, x; represents the t-th token, and xe; is the 
sequence of tokens preceding x+. We refer to this as an autoregressive language 
model that iteratively predicts the probability of the next token. Neural network 
approaches to language modelling have superseded classical approaches, such as 
n-grams [41]. Especially Transformers [42] were shown to be the most effective 
architecture at the time of writing [1,23,36]. 

While fine-tuning neural models on a specific translation task remains a valid 
approach showing also initial success in generalizing to unstructured natural lan- 
guage when translating to LTL [19], a common technique to obtain high perfor- 
mance with limited amount of labeled data is so-called “few-shot prompting” [3]. 
The language model is presented a natural language description of the task usu- 
ally accompanied with a few examples that demonstrate the input-output behav- 
ior. The framework presented in this paper relies on this technique. We describe 
the proposed few-shot prompting scheme in detail in Sect. 3.2. 

Currently implemented in the framework and used in the expert-user study 
are Codex and Bloom, which showed the best performance during testing. 
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Codex and GPT-3.5-turbo. Codex [5] is a GPT-3 variant that was initially of 
up to 12B parameters in size and fine-tuned on code. The initial version of 
GPT-3 itself was trained on variations of Common Crawl,” Webtext-2 [37], two 
internet-based book corpora and Wikipedia [3]. The fine-tuning dataset for the 
vanilla version Codex was collected in May 2020 from 54 million public software 
repositories hosted on GitHub, using 159GB of training data for fine-tuning. For 
our experiments, we used the commercial 2022 version of code-davinci-002, 
which is likely larger (in the 176B range?) than the vanilla codex models. GPT- 
3.5-turbo is the currently available follow-up model of GPT-3. 


Bloom. Bloom [39] is an open-source LLM family available in different sizes of 
up to 176B parameters trained on 46 natural languages and 13 programming 
languages. It was trained on the ROOTS corpus [27], a collection of 498 hugging- 
face [29,48] datasets consisting of 1.61 terabytes of text. For our experiments, 
we used the 176B version running on the huggingface inference API*. 


3 The nl2spec Framework 


3.1 Overview 


The framework follows a standard frontend-backend implementation. Figure 2 
shows an overview of the implementation of nl2spec. Parts of the framework 
that can be extended for further research or usage in practice are highlighted. The 
framework is implemented in Python 3 and flask [44], a lightweight WSGI web 
application framework. For the experiments in this paper, we use the OpenAI 
library and huggingface (transformer) library [47]. We parse the LTL output 
formulas with a standard LTL parser [12]. The tool can either be run as a 
command line tool, or with the web-based frontend. 

The frontend handles the interaction with a human-in-the-loop. The inter- 
face is structured in three views: the “Prompt”, “Sub-translations”, and “Final 
Result” view (see Fig. 1). The tool takes a natural language sentence, optional 
sub-translations, the model temperature, and number of runs as input. It pro- 
vides sub-translations, a confidence score, alternative sub-translations and the 
final formalization as output. The frontend then allows for interactively select- 
ing, editing, deleting, or adding sub-translations. The backend implements the 
handling of the underlying neural models, the generation of the prompt, and 
the ambiguity resolving, i.e., computing the confidence score including alter- 
native sub-translations and the interactive few-shot prompting algorithm (cf. 
Sect. 3.2). The framework is designed to have an easy interface to implement 
new models and write domain-specific prompts. The prompt is a .txt file that 
can be adjusted to specific domains to increase the quality of translations. To 
apply the sub-translation refinement methodology, however, the prompt needs to 
follow our interactive prompting scheme, which we introduce in the next section. 


? https: //commoncrawl.org/. 
3 https: //blog.eleuther.ai/gpt3-model-sizes/. 
4 https: //huggingface.co/inference-api. 
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Fig. 2. Overview of the nl2spec framework with a human-in-the-loop: highlighted 
areas indicate parts of the framework that are effortlessly extendable. 


3.2 Interactive Few-Shot Prompting 


The core of the methodology is the decomposition of the natural language input 
into sub-translations. We introduce an interactive prompting scheme that gener- 
ates sub-translations using the underlying neural model and leverages the sub- 
translations to produce the final translation. Algorithm 1 depicts a high-level 
overview of the interactive loop. The main idea is to give a human-in-the-loop 
the options to add, edit, or delete sub-translations and feed them back into 
the language models as “Given translations” in the prompt (see Fig. 3). After 
querying a language model M with this prompt F, model specific parameters P 
and the interactive prompt that is computed in the loop, the model generates 
a natural language explanation, a dictionary of sub-translations, and the final 
translation. Notably, the model M can be queried multiple times as specified 
by the number of runs r, thereby generating multiple possible sub-translations. 
The confidence score of each sub-translation is computed as votes over multiple 
queries and by default the sub-translation with the highest confidence score is 
selected to be used as a given sub-translation in the next iteration. In the fron- 
tend, the user may view and select alternative generated sub-translations for 
each sub-translation via a drop-down menu (see Fig. 1). 

Figure3 shows a generic prompt, that illustrates our methodology. The 
prompting scheme consists of three parts. The specification language specific 
part (lines 1-4), the fewshot examples (lines 5-19), and the interactive prompt 
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minimal .txt 


Translate the following natural language sentences into an LTL formula and explain your 
translation step by step. Remember that X means "next", U means "until", G means 
"globally", F means "finally", which means GF means "infinitely often". The formula 
should only contain atomic propositions or operators &, ~, ->, <->, X, U, G, F. 
Natural Language: Globally if a holds then c is true until b. Given translations: {} 
Explanation: "a holds" from the input translates to the atomic proposition a. 

"c is true until b" from the input translates to the subformula c U b. "if x then y" 
translates to an implication x -> y, so "if a holds then c is true until b" translates 
to an implication a -> c U b. "Globally" from the input translates to the temporal 

10 operator G. Explanation dictionary: {"a holds" : "a", "c is true until b" : "c U b", 
ga "if a holds then c is true until b" : "a -> c U b", "Globally" : "G"} So the final 

12 LTL translation is G a -> c U b.FINISH Natural Language: Every request r is 

13 eventually followed by a grant g. Given translations: {} Explanation: "Request r" 

14 from the input translates to the atomic proposition r and "grant g" translates to the 
15 atomic proposition g. "every" means at every point in time, i.e., globally, "never" 

16 means at no point in time, and "eventually" translates to the temporal operator F. 


Noah one 


© œ 


LY: "followed by" is the natural language representation of an implication. Explanation 
18 dictionary: {"Request r" : "r", "grant g" : "g", "every" : "G", "eventually": "F", 
19 "followed by" : "->"} So the final LTL translation is G r -> F g.FINISH 


Fig. 3. Prompt with minimal domain knowledge of LTL. 


including the natural language and sub-translation inputs (not displayed, given 
as input). The specification language specific part leverages “chain-of-thought” 
prompt-engineering to elicit reasoning from large language models [46]. The key 
of nl2spec, however, is the setup of the few-shot examples. This minimal prompt 
consists of two few-shot examples (lines 5-12 and 12-19). The end of an exam- 
ple is indicated by the “FINISH” token, which is the stop token for the machine 
learning models. A few-shot example in nl12spec consists of the natural language 
input (line 5), a dictionary of given translations, i.e., the sub-translations (line 
5), an explanation of the translation in natural language (line 6-10), an expla- 
nation dictionary, summarizing the sub-translations, and finally, the final LTL 
formula. 

This prompting scheme elicits sub-translations from the model, which serve 
as a fine-grained explanation of the formalization. Note that sub-translations 
provided in the prompt are neither unique nor exhaustive, but provide the con- 
text for the language model to generate the correct formalization. 


4 Evaluation 


In this section, we evaluate our framework and prompting methodology on a data 
set obtained by conducting an expert user study. To show the general applica- 
bility of this framework, we use the minimal prompt that includes only minimal 
domain knowledge of the specification language (see Fig.3). This prompt has 
intentionally been written before conducting the expert user study. We lim- 
ited the few-shot examples to two and even provided no few-shot example that 
includes “given translations”. We use the minimal prompt to focus the evaluation 
on the effectiveness of our interactive sub-translation refinement methodology in 
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Algorithm 1: Interactive Few-shot Prompting Algorithm 
1 Input: Natural language S, Few-shot prompt F, set of given sub-translations 
(s,y), and language model M 
2 Interactions: set of sub-translations (s, p), confidence scores C 
3 Set of Model specific parameter P: e.g., model-temperature t, number of 
runs r 
4 Output: LTL formula w that formalizes S' 
1: w, (s,y) ,C = empty 
2: while user not approves LTL formula ~ do 
3 interactive_prompt = compute_prompt(S, F, (s, y)) 
4 w, (s,p) ,C = query(M, P, interactive_prompt) 
5:  (s,y) = user_interaction((s,y) ,C) 
6: end while 
7: return w 


resolving ambiguity and fixing erroneous translations. In practice, one would like 
to replace this minimal prompt with domain-specific examples that capture the 
underlying distribution as closely as possible. As a proof of concept, we elaborate 
on this in the full version [8]. 


4.1 Study Setup 


To obtain a benchmark dataset of unstructured natural language and their for- 
malizations into LTL, we asked five experts in the field to provide examples that 
the experts thought are challenging for a neural translation approach. Unlike 
existing datasets that follow strict grammatical and syntatical structure, we 
posed no such restrictions on the study participants. Each natural language 
specification was restricted to one sentence and to five atomic propositions 
a,b,c, d,e. Note that nl2spec is not restricted to a specific set of atomic propo- 
sitions (cf. Fig. 1). Which variable scheme to use can be specified as an initial 
sub-translation. We elaborate on this in the full version [8]. To ensure unique 
instances, the experts worked in a shared document, resulting in 36 benchmark 
instances. We provide three randomly drawn examples for the interested reader: 


natural language S LTL specification Y 

If b holds then, in the next step, c holds until a holds or always c holds |b -> X ((c U a) || G c) 
If b holds at some point, a has to hold somewhere beforehand (F b) -> (!b U (a & !b)) 
One of the following aps will hold at all instances: a,b,c GCa lbc) 


The poor performance of existing methods (cf. Table 1) exemplify the diffi- 
culty of this data set. 


4.2 Results 


We evaluated our approach using the minimal prompt (if not otherwise stated), 
with number of runs set to three and with a temperature of 0.2. 
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Quality of Initial Translation. We analyze the quality of initial translations, i.e., 
translations obtained before any human interaction. This experiment demon- 
strates that the initial translations are of high quality, which is important to 
ensure an efficient workflow. We compared our approach to fine-tuning language 
models on structured data [19] and to an approach using GPT-3 or Rasa [2] to 
translate natural language into a restricted set of declare patterns [13] (which 
could not handle most of the instances in the benchmark data set, even when 
replacing the atomic propositions with their used entities). The results of eval- 
uating the accuracy of the initial translations on our benchmark expert set is 
shown in Table 1. 

At the time of writing, using Codex in the backend outperforms GPT-3.5- 
turbo and Bloom on this task, by correctly translating 44.4% of the instances 
using the minimal prompt. We only count an instance as correctly translated 
if it matches the intended meaning of the expert, no alternative translation 
to ambiguous input was accepted. Additionally to the experiments using the 
minimal prompt, we conducted experiments on an augmented prompt with in- 
distribution examples after the user study was conducted by randomly drawing 
four examples from the expert data set (3 of the examples haven’t been solved 
before, see the GitHub repository or full version for more details). With this in- 
distribution prompt (ID), the tool translates 21 instances (with the four drawn 
examples remaining in the set), i.e., 58.3% correctly. 

This experiment shows 1) that the initial translation quality is high and 
can handle unstructured natural language better than previous approaches and 
2) that drawing the few-shot examples in distribution only slightly increased 
translation quality for this data set; making the key contributions of nl2spec, 
i.e., ambiguity detection and effortless debugging of erroneous formalizations, 
valuable. Since nl2spec is agnostic to the underlying machine learning models, 
we expect an even better performance in the future with more fine-tuned models. 


Teacher-Student Experiment. In this experiment, we generate an initial set of 
sub-translations with Codex as the underlying neural model. We then ran the 
tool with Bloom as a backend, taking these sub-translations as input. There were 
11 instances that Codex could solve initially that Bloom was unable to solve. On 
these instances, Bloom was able to solve 4 more instances, i.e., 36.4% with sub- 
translations provided by Codex. The four instances that Bloom was able to solve 


Table 1. Translation accuracy on the benchmark data set, where B stands for Bloom 
and C stands for Codex and G for GPT-3.5-Turbo. 


nl2ltl [13] | T-5 [19] nl2spec+B |nl2spec+C |nl2spec+C |nl2spec+C 
rasa fine-tuned | initial initial initial+ID interactive 
1/36 (2.7%) | 2/36 (5.5%) | 5/36 (13.8%) | 16/36 (44.4%) | 21/36 (58.3%) | 31/36 (86.1%) 
E E - nl2spec+G |nl2spec+G |nl2spec+G 
initial initial+ID interactive 
-— — = 12/36 (33.3%) | 17/36 (47.2%) | 21/36 (58.3%) 
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with the help of Codex were: “It is never the case that a and b hold at the same 
time.”, “Whenever a is enabled, b is enabled three steps later.”, “If it is the case 
that every a is eventually followed by a b, then c needs to holds infinitely often.”, 
and “One of the following aps will hold at all instances: a,b,c”. This demonstrates 
that our sub-translation methodology is a valid appraoch: improving the quality 
of the sub-translations indeed has a positive effect on the quality of the final 
formalization. This even holds true when using underperforming neural network 
models. Note that no supervision by a human was needed in this experiment to 
improve the formalization quality. 


Ambiguity Detection. Out of the 36 instances in the benchmark set, at least 9 of 
the instances contain ambiguous natural language. We especially observed two 
classes of ambiguity: 1) ambiguity due to the limits of natural language, e.g., 
operator precedence, and 2) ambiguity in the semantics of natural language; 
nl2spec can help in resolving both types of ambiguity. Details for the following 
examples can be found in the full version [8]. 

An example for the first type of ambiguity from our dataset is the example 
mentioned in the introduction: “a holds until b holds or always a holds”, which 
the expert translated into (a U b) | G a. Running the tool, however, trans- 
lated this example into (a U (b | G(a))). By editting the sub-translation of 
“a holds until b holds” to (a U b) through adding explicit parenthesis, the tool 
translates as intended. An example for the second type of ambiguity is the follow- 
ing instance from our data set: “Whenever a holds, b must hold in the next two 
steps.” The intended meaning of the expert was G (a -> (b | X b)), whereas 
the tool translated this sentence into G((a -> X(X(b)))). After changing the 
sub-translation of “b must hold in the next two steps” to b | X b, the tool 
translates the input as intended. 


Fixing Erroneous Translation. With the inherent ambiguity of natural lan- 
guage and the unstructured nature of the input, the tool’s translation cannot 
be expected to be always correct in the first try. Verifying and debugging sub- 
translations, however, is significantly easier than redrafting the complete for- 
mula from scratch. Twenty instances of the data set were not correctly trans- 
lated in an initial attempt using Codex and the minimal prompt in the backend 
(see Table 1). We were able to extract correct translations for 15 instances by 
performing at most three translation loops (i.e., adding, editing, and removing 
sub-translations), We were able to get correct results by performing 1.86 trans- 
lation loops on average. For example, consider the instance, “whenever a holds, 
b holds as well”, which the tool mistakenly translated to G(a & b). By fixing 
the sub-translation “b holds as well” to the formula fragment -> b, the sentence 
is translated as intended. Only the remaining five instances that contain highly 
complex natural language requirements, such as, “once a happened, b won’t 
happen again” were need to be translated by hand. 

In total, we correctly translated 31 out of 36 instances, i.e., 86.11% using the 
nl2spec sub-translation methodology by performing only 1.4 translation loops 
on average (see Table 1). 


nl2spec 393 


5 Conclusion 


We presented nl2spec, a framework for translating unstructured natural lan- 
guage to temporal logics. A limitation of this approach is its reliance on compu- 
tational resources at inference time. This is a general limitation when applying 
deep learning techniques. Both, commercial and open-source models, however, 
provide easily accessible APIs to their models. Additionally, the quality of initial 
translations might be influenced by the amount of training data on logics, code, 
or math that the underlying neural models have seen during pre-training. 

At the core of nl2spec lies a methodology to decompose the natural language 
input into sub-translations, which are mappings of formula fragments to relevant 
parts of the natural language input. We introduced an interactive prompting 
scheme that queries LLMs for sub-translations, and implemented an interface 
for users to interactively add, edit, and delete the sub-translations, which avoids 
users from manually redrafting the entire formalization to fix erroneous transla- 
tions. We conducted a user study, showing that nl2spec can be efficiently used 
to interactively formalize unstructured and ambigous natural language. 


Acknowledgements. We thank OpenAI for providing academic access to Codex and 
Clark Barrett for helpful feedback on an earlier version of the tool. 
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Abstract. This manuscript presents the updated version of the Neural 
Network Verification (NNV) tool. NNV is a formal verification software 
tool for deep learning models and cyber-physical systems with neural net- 
work components. NNV was first introduced as a verification framework 
for feedforward and convolutional neural networks, as well as for neural 
network control systems. Since then, numerous works have made signif- 
icant improvements in the verification of new deep learning models, as 
well as tackling some of the scalability issues that may arise when veri- 
fying complex models. In this new version of NNV, we introduce verifica- 
tion support for multiple deep learning models, including neural ordinary 
differential equations, semantic segmentation networks and recurrent neu- 
ral networks, as well as a collection of reachability methods that aim to 
reduce the computation cost of reachability analysis of complex neural net- 
works. We have also added direct support for standard input verification 
formats in the community such as VNNLIB (verification properties), and 
ONNX (neural networks) formats. We present a collection of experiments 
in which NNV verifies safety and robustness properties of feedforward, con- 
volutional, semantic segmentation and recurrent neural networks, as well 
as neural ordinary differential equations and neural network control sys- 
tems. Furthermore, we demonstrate the capabilities of NNV against a com- 
mercially available product in a collection of benchmarks from control sys- 
tems, semantic segmentation, image classification, and time-series data. 


Keywords: neural networks - cyber-physical systems - verification - 
tool 


1 Introduction 


Deep Learning (DL) models have achieved impressive performance on a wide 
range of tasks, including image classification [13,24,44], natural language pro- 
cessing [15,25], and robotics [47]. Recently, the usage of these models has 
expanded into many other areas, including safety-critical domains, such as 
autonomous vehicles [9, 10,85]. However, deep learning models are opaque sys- 
tems, and it has been demonstrated that their behavior can be unpredictable 
when small changes are applied to their inputs (i.e., adversarial attacks) [67]. 
© The Author(s) 2023 
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Therefore, for safety-critical applications, it is often necessary to comprehend 
and analyze the behavior of the whole system, including reasoning about the 
safety guarantees of the system. To address this challenge, many researches have 
been developing techniques and tools to verify Deep Neural Networks (DNN) 
[4, 6, 22,39, 40, 48,55, 64,65, 77, 83,84,86,87], as well as learning-enabled Cyber- 
Physical Systems (CPS) [3,8, 12,23, 26,34,35,38,50,51]. It is worth noting that 
despite the growing research interest, the verification of deep learning models still 
remains a challenging task, as the complexity and non-linearity of these models 
make them difficult to analyze. Moreover, some verification methods suffer from 
scalability issues, which limits the applicability of some existing techniques to 
large-scale and complex models. Another remaining challenge is the extension of 
existing or new methods for the verification of the extensive collection of layers 
and architectures existing in the DL area, such as Recurrent Neural Networks 
(RNN) [37], Semantic Segmentation Neural Networks (SSNN) [58] or Neural 
Ordinary Differential Equations (ODE) [11]. 

This work contributes to addressing the latter challenge by introducing ver- 
sion 2.0 of NNV! (Neural Network Verification)”, which is a software tool that 
supports the verification of multiple DL models as well as learning-enabled CPS, 
also known as Neural Network Control Systems (NNCS) [80]. NNV is a software 
verification tool with the ability to compute exact and over-approximate reach- 
able sets of feedforward neural networks (FFNN) [75,77,80], Convolutional Neu- 
ral Networks (CNN) [78], and NNCS [73,80]. In NNV 2.0, we add verification 
support of 3 main DL models: 1) RNNs [74], 2) SSNNs (encoder-decoder archi- 
tectures) [79], and 3) neural ODEs [52], as well as several other improvements 
introduced in Sect. 3, including support for The Verification of Neural Networks 
Library (VNNLIB) [29] and reachability methods for MaxUnpool and Leaky 
ReLU layers. Once the reachability computation is completed, NNV is capable of 
verifying a variety of specifications such as safety or robustness, very commonly 
used in learning-enabled CPS and classification domains, respectively [50,55]. 
We demonstrate NNV capabilities through a collection of safety and robustness 
verification properties, which involve the reachable set computation of feedfor- 
ward, convolutional, semantic segmentation and recurrent neural networks, as 
well as neural ordinary differential equations and neural network control systems. 
Throughout these experiments, we showcase the range of the existing methods, 
executing up to 6 different star-based reachability methods that we compare 
against MATLAB’s commercially available verification tool [69]. 


2 Related Work 


The area of DNN verification has increasingly grown in recent years, leading 
to the development of standard input formats [29] as well as friendly com- 
petitions [50,55], that help compare and evaluate all the recent methods and 
tools proposed in the community [4,6,19,22,31,39-41, 48, 55,59, 64, 65, 77, 83, 84, 


1 Code available at: https: //github.com/verivital/nnv/releases/tag/cav2023. 
? Archival version: https://doi.org/10.24433/CO.0803700.v1. 
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86,87]. However, the majority of these methods focus on regression and classifica- 
tion tasks performed by FFNN and CNN. In addition to FFNN and CNN verifi- 
cation, Tran et al. [79] introduced a collection of star-based reachability analysis 
that also verify SSNNs. Fischer et al. [21] proposed a probabilistic method for the 
robustness verification of SSNNs based on randomize smoothing [14]. Since then, 
some of the other recent tools, including Verinet [31], a,G-Crown [84,87], and 
MN-BaB [20] are also able to verify image segmentation properties as demon- 
strated in [55]. A less explored area is the verification of RNN. These models have 
unique “memory units” that enable them to store information for a period of 
time and learn complex patterns of time-series or sequential data. However, due 
to their memory units, verifying the robustness of RNNs is challenging. Recent 
notable state-of-the-art methodologies for verifying RNNs include unrolling the 
network into an FFNN and then verify it [2], invariant inference [36,62,90], and 
star-based reachability [74]. Similar to RNNs, neural ODEs are also deep learning 
models with “memory”, which makes them suitable to learn time-series data, but 
are also applicable to other tasks such as continuous normalizing flows (CNF) 
and image classification [11,61]. However, existing work is limited to a stochastic 
reachability approach [27,28], reachability approaches using star and zonotope 
reachability methods for a general class of neural ODEs (GNODE) with contin- 
uous and discrete time layers [52], and GAINS [89], which leverages ODE-solver 
information to discretize the models using a computation graph that represent 
all possible trajectories from a given input to accelerate their bound propaga- 
tion method. However, one of the main challenges is to find a framework that is 
able to verify several of these models successfully. For example, a,G-Crown was 
the top performer on last year’s NN verification competition [55], able to verify 
FFNN, CNN and SSNNs, but it lacks support for neural ODEs or NNCS. There 
exist other tools that focus more on the verification of NNCS such as Verisig 
[34,35], Juliareach [63], ReachNN [17,33], Sherlock [16], RINO [26], VenMas [1], 
POLAR [32], and CORA [3,42]. However, their support is limited to NNCS 
with a linear, nonlinear ODE or hybrid automata as the plant model, and a 
FFNN as the controller. 

Finally, for a more detailed comparison to state-of-the-art methods for the 
novel features of NNV 2.0, we refer to the comparison and discussion about 
neural ODEs in [52]. For SSNNs [79], there is a discussion on scalability and 
conservativeness of methods presented (approx and relax star) for the different 
layers that may be part of a SSNN [79]. For RNNs, the approach details and 
a state-of-the-art comparison can be found in [74]. We also refer the reader to 
two verification competitions, namely VNN-COMP [6,55] and AINNCS ARCH- 
COMP [38,50], for a comparison on state-of-the-art methods for neural network 
verification and neural network control system verification, respectively. 


3 Overview and Features 


NNV is an object-oriented toolbox developed in MATLAB [53] and built on top 
of several open-source software, including CORA [3] for reachability analysis of 
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nonlinear ordinary differential equations (ODE) [73] and hybrid automata, MPT 
toolbox [45] for polytope-based operations [76], YALMIP [49] for some optimiza- 
tion problems in addition to MATLAB’s Optimization Toolbox [53] and GLPK 
[56], and MatConvNet [82] for some convolution and pooling operations. NNV 
also makes use of MATLAB’s deep learning toolbox to load the Open Neu- 
ral Network Exchange (ONNX) format [57,68], and the Hybrid Systems Model 
Transformation and Translation tool (HyST) [5] for NNCS plant configuration. 

NNV consists of two main modules: a computation engine and an analyzer, 
as illustrated in Fig.1. The computation engine module consists of four com- 
ponents: 1) NN constructor, 2) NNCS constructor, 3) reachability solvers, and 
4) evaluator. The NN constructor takes as an input a neural network, either 
as a DAGNetwork, dlnetwork, SeriesNetwork (MATLAB built-in formats) [69], 
or as an ONNX file [57], and generates a NN object suitable for verification. 
The NNCS constructor takes as inputs the NN object and an ODE or Hybrid 
Automata (HA) file describing the dynamics of a system, and then creates an 
NNCS object. Depending on the task to solve, either the NN (or NNCS) object 
is passed into the reachability solver to compute the reachable set of the system 
from a given set of initial conditions. Then, the computed set is sent to the ana- 
lyzer module to verify/falsify a given property, and/or visualize the reachable 
sets. Given a specification, the verifier can formally reason whether the spec- 
ification is met by computing the intersection of the define property and the 
reachable sets. If an exact (sound and complete) method is used, (e.g., exact- 
star), the analyzer can determine if the property is satisfied or unsatisfied. If an 
over-approximate (sound and incomplete) method is used, the verifier may also 
return “uncertain” (unknown), in addition to satisfied or unsatisfied. 


r 
——_ nearest ; | Analyzer 
H Computation Engine reachable | oz a |_|, Plotof reachable 
i = reachable + | Visualizer |——> 
Network j NN Reachability |; sets ! -=== i sets/traces 
= >| — m > 1 
Configuration | Constructor solvers Saf 
i z I ! ! afe/ 
i e R 
' Initial Condition + i Verifier M` Unsafe/ 
i ENNES + ‘Evaluation | a =) Robust 
Plant i i i nsafe/ | 
ë ant o 3| Evaluator races Uncertain?! 
onfiguration ' | Constructor ' ' ¥ n Set of counter 
aL A Se OS IETT. ' Falsifier H> _ inputs or 


unsafe traces 


Fig. 1. An overview of NNV and its major modules and components. 


3.1 NNV 2.0 vs NNV 


Since the introduction of NNV [80], we have added to NNV support for the 
verification of a larger subset of deep learning models. We have added reacha- 
bility methods to verify SSNNs [79], and a collection of relax-star reachability 
methods [79], reachability techniques for Neural ODEs [52] and RNNs [74]. In 
addition, there have been changes that include the creation of a common NN 
class that encapsulates previously supported neural network classes (FFNN and 
CNN) as well as Neural ODEs, SSNNs, and RNNs, which significantly reduces 
the software complexity and simplifies user experience. We have also added direct 
support for ONNX [57], as well as a parser for VNN-LIB [29], which describes 
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properties to verify of any class of neural networks. We have also added flexibility 
to use one of the many solvers supported by YALMIP [49], GLPK [56] or lin- 
prog [70]. Table 1 shows a summary of the major features of NNV, highlighting 
the novel features. 


Table 1. Overview of major features available in NNV. Links refer to relevant 
files/classes in the NNV codebase. BN refers to batch normalization layers, FC to 
fully-connected layers, AvgPool to average pooling layers, Conv to convolutional lay- 
ers, and MaxPool to max pooling layers. 


Feature Supported (NNV 2.0 additions in blue) 

Neural Network Type FENN, CNN, NeuralODE, SSNN, RNN 

Layers MaxPool, Conv, BN, AvgPool, FC, MaxUnpool, TC, DC, NODE 
Activation functions ReLU, Satlin, Sigmoid, Tanh, Leaky ReLU, Satlins 

Plant dynamics (NNCS) Linear ODE, Nonlinear ODE, HA, Continuous & Discrete Time 

Set Representation Polyhedron, Zonotope, Star, ImageStar 

Star Reach methods exact, approx, abs-dom, relax-range, relax-area, relax-random, relax-bound 
Reachable set visualization | Yes, exact and over-approximation 

Verification Safety, Robustness, VNNLIB 

Miscellaneous Parallel computing, counterexample generation, ONNX* 

*ONNX was partially supported for feedforward neural networks through 


NNVMT. Support has been extended to other NN types without the need 
for external libraries. 


Semantic Segmentation [79]. Semantic segmentation consists on classifying 
image pixels into one or more classes which are semantically interpretable, like 
the different objects in an image. This task is common in areas like perception 
for autonomous vehicles, and medical imaging [71], which is typically accom- 
plished by neural networks, referred to as semantic segmentation neural net- 
works (SSNNs). These are characterized by two major portions, the encoder, or 
sequence of down-sampling layers to extract important features in the input, and 
the decoder, or sequence of up-sampling layers, to scale back the data informa- 
tion and classify each pixel into its corresponding class. Thus, the verification of 
these models is rather challenging, due to the complexity of the layers, and the 
output space dimensionality. We implement in NNV the collection of reachabil- 
ity methods introduced by Tran et al. [79], that are able to verify the robustness 
of a SSNNs. This means that we can formally guarantee the robustness value for 
each pixel, and determine the percentage of pixels that are correctly classified 
despite the adversarial attack. This was demonstrated using several architectures 
on two datasets: MNIST and M2NIST [46]. To achieve this, additional support 
for transposed and dilated convolutional layers was added [79]. 


Neural Ordinary Differential Equations [52]. Continuous deep learning 
models, referred to as Neural ODEs, have received a growing consideration over 
the last few years [11]. One of the main reasons for their popularity is due to 
their memory efficiency and their ability to learn from irregularly sampled data 
[61]. Similarly to SSNNs, despite their recent popularity, there is very limited 
work on the formal verification of these models [52]. For this reason, we imple- 
mented in NNV the first deterministic verification approach for a general class 
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of neural ODEs (GNODE), which supports GNODEs to be constructed with 
multiple continuous layers (neural ODEs), linear or nonlinear, as well as any 
discrete-time layer already supported in NNV, such as ReLU, fully-connected 
or convolutional layers [52]. NNV demonstrates its capabilities in a series of 
time-series, control systems and image classification benchmarks, where it sig- 
nificantly outperforms any of the compared tools in the number of benchmarks 
and architectures supported [52]. 


Recurrent Neural Networks [74]. We implement star-based verification 
methods for RNNs introduced in [74]. These are able to verify RNNs without 
unrolling, reducing accumulated over-approximation error by optimized relax- 
ation in the case of approximate reachability. The star set is an efficient technique 
in the computation of RNN reachable sets due to its advantages in computing 
affine mapping, the intersection of half-spaces, and Minkowski summation [74]. 
A new star set representing the reachable set of the current hidden state can 
be directly and efficiently constructed based on the reachable sets of the pre- 
vious hidden state and the current input set. As proposed in verifying FFNNs 
(7,77,78], CNNs [72], and SSNNs [79], tight and efficient over-approximation 
reachability can be applied to the verification of ReLU RNNs. The triangular 
over-approximation of ReLU enables a tight over-approximation of the exact 
reachable set, preventing exponentially increasing the number of star sets dur- 
ing splitting. Estimation of the state bound required for over-approximation can 
compute state bounds without solving LPs. Furthermore, the relaxed approx- 
imate reachability estimates the triangle over-approximation areas to optimize 
the ranges of state by solving LP optimization. Consequently, the extended exact 
reachability method is 10x faster, and the over-approximation method is 100x 
to 5000x faster than existing state-of-the-art methods [74]. 


Zonotope Pre-filtering Star Set Reachability [78]. The star-based reacha- 
bility methods are improved by using the zonotope pre-filtering approach [7,78]. 
This improvement consists on equipping the star set with an outer-zonotope, 
on the reachability analysis of a ReLU layer, to estimate quickly the lower and 
upper bounds of the star set at each specific neuron to establish if splitting may 
occur at this neuron without the need to solve any LP problems. The reduction 
of LP optimizations to solve is critical for the scalability of star-set reachability 
methods [77]. For the exact analysis, we are able to avoid the use of the zonotope 
pre-filtering, since we can efficiently construct the new output set with one star, 
if the zero point is not within the set range, or the union of 2 stars, if the zero 
point is contained [78]. In the over-approximation star, the range information is 
required to construct the output set at a specific neuron if and only if the range 
contains the zero point. 


Relax-Star Reachability [79]. To tackle some of the scalability problems that 
may arise when computing the reachable set of complex neural networks such as 
SSNNs, a collection of four relaxed reachability methods were introduced [79]. 
The main goal of these methods is to reduce the number of Linear Programming 
(LP) problems to solve by quickly estimating the bounds or the reachable set, 
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and only solving a fraction of the LP problems, while over-approximating the 
others. The LPs to solve are determined by the heuristics chosen, which can be 
random, area-based, bound-based, or range-based. The number of LPs is also 
determined by the user, who can choose from 0% to 100%. The closer to 100%, 
the larger number of LPs are skipped and over-approximated, thus the reachable 
set tends to be a larger over-approximation of the output, which significantly 
reduces the computation time [79]. 


Other Updates. In addition to the previous features described, there is a set 
of changes and additions included in the latest NNV version: 

- Activation Functions. The star set method is extended to other classes of 
piecewise activation functions such as saturating linear layer (satlin), saturating 
linear symmetric layer (satlins), and leaky ReLU. The reachability analysis of 
each of these functions can be performed similarly to ReLU layers using the 
zonotope pre-filtering method to find where splits happen. 

- LP solver. We generalize the use of LP solvers across all methods and 
optimizations. We allow the user to select the solver to use, which can choose 
between GLPK [56], linprog [70] (MATLAB’s Optimization Toolbox) or any of 
the solvers supported by YALMIP [49]. We select linprog as the default solver, 
while keeping GLPK as a backup. However, if a different solver is selected that 
is supported by YALMIP, our implementation of the LP solver abstraction also 
supports this selection for any reachability method. 

- Standard Input Formats. In the past few years, the verification community 
has been working to standardize formats across all tools to facilitate comparison 
among them. We have improved NNV by replacing the NNVMT tool [81] with a 
module to load ONNX [57] networks directly from MATLAB, as well as adding 
support for VNNLIB [29] files to define NN properties. 


4 Evaluation 


The evaluation is divided into 4 sections: 1) Comparison of FFNN and CNN 
to MATLAB’s commercial toolbox [53,69], 2) Reachability analysis of Neural 
ODEs [52], 3) Robustness Verification of RNNs [74], and 4) Robustness Verifica- 
tion of SSNNs [79]. The results presented were all performed on a desktop with 
the following configuration: AMD Ryzen 9 5900X @3.7GHz 12-Core Processor, 
64 GB Memory, and 64-bit Microsoft Windows 10 Pro. 


4.1 Comparison to MATLAB’s Deep Learning Verification Toolbox 


In this comparison, we make use of a subset of the benchmarks and properties 
evaluated in last year’s Verification of Neural Network (VNN) [55] competition, 
in which we demonstrate the capabilities of NNV with respect to the latest 
commercial product from MATLAB for the verification of neural networks [69]. 

We compared them on a subset of benchmarks from VNN-COMP’22 [55]: 
ACAS Xu, Tilverify, Oval21 (CIFAR10 [43]), and RL benchmarks, which con- 
sists on verifying 90 out of 145 properties of the ACAS Xu, where we compare 
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Table 2. Verification of ACAS Xu properties 3 and 4. 


matlab | approx | relax 25% | relax 50% | relax 75% | relax 100% | exact (8) 
prop 3 (45) SAT 3 3 3 2 0 0 3 
UNSAT |10 29 8 2 1 0 42 
time (s)| 0.1383 | 0.6368 | 0.6192 0.5714 0.3843 0.0276 521.9 
prop 4 (45) | SAT 1 3 3 2 0 0 3 
UNSAT |2 32 6 1 1 0 42 
time (s)| 0.1387 | 0.6492 | 0.6420 0.5682 0.3568 0.0261 89.85 


Table 3. Verification results of the RL, tllverify and oval21 benchmarks. We selected 
50 random specifications from the RL benchmarks, 10 from tllverify and all 30 from 
oval21. - means that the benchmark is not supported. 


RL (50) Tllverify (10) Oval21 (30) 

SAT | UNSAT | time (s)| SAT | UNSAT | time (s) | SAT | UNSAT | time (s) 
matlab | 20 11 0.0504 0 0 0.1947 |- = = 
NNV |32 |14 0.0822 (0 0 13.57 0 11 136.5 


MATLAB’s methods, approx-star, exact (parallel, 8 cores) and 4 relax-star meth- 
ods. From the other 3 benchmarks, we select a total of 90 properties to verify, 
from which we limit the comparison to the approx-star and MATLAB’s method. 
In this section, we demonstrate NNV is able to verify fully-connected layers, 
ReLU layers, flatten layers, and convolutional layers. The results of this compar- 
ison are described in Table 2. We can observe that MATLAB’s computation time 
is faster than NNV star methods, except for the relax star with 100% relaxation. 
However, NNV’s exact and approx methods significantly outperform MATLAB’s 
framework by verifying 100% and 74% of the properties respectively, compared to 
18% from MATLAB’s. The remainder of the comparison is described in Table 3, 
which shows a similar trend: MATLAB’s computation is faster, while NNV is 
able to verify a larger fraction of the properties. 


4.2 Neural Ordinary Differential Equations 


We exhibit the reachability analysis of GNODEs with three tasks: dynamical 
system modeling of a Fixed Point Attractor (FPA) [52,54], image classification 
of MNIST [46], and an adaptive cruise control (ACC) system [73]. 


Dynamical Systems. For the FPA, we compute the reachable set for a time 
horizon of 10s, given a perturbation of + 0.01 on all 5 input dimensions. The 
results of this example are illustrated in Fig. 2c, with a computation time of 
3.01s. The FPA model consists of one nonlinear neural ODE, no discrete-time 
layers are part of this model [52]. 


Classification. For the MNIST benchmark, we evaluate the robustness of two 
GNODEs with convolutional, fully-connected, ReLU and neural ODE layers, 
corresponding to CNODEg and CNODE m models introduced in [52]. We verify 
the robustness of 5 random images under an Læ attack with a perturbation 
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Fig. 2. Verification of RNN and neural ODE results. Figure 2a shows the verification 
time of the 3 RNNs evaluated. Figure 2b depicts the safety verification of the ACC, 
and Fig. 2c shows the reachability results of the FPA benchmark. 


value of + 0.5 on all the pixels. We are able to prove the robustness of both 
models on 100% of images, with an average computation time of 16.3s for the 
CNODEs, and 119.9s for the CNODE m. 


Control Systems. We verify an NNCS of an adaptive cruise control (ACC) 
system, where the controller is a FFNN with 5 ReLU layers with 20 neurons 
each, and one output linear layer, and the plant is a nonlinear neural ODE [52]. 
The verification results are illustrated in Fig. 2b, showing the current distance 
between the ego and lead cars and the safety distance allowed. We can observe 
that there is no intersection between the two, guaranteeing its safety. 


4.3 Recurrent Neural Networks 


For the RNN evaluation, we evaluate of three RNNs trained on the speaker 
recognition VCTK dataset [88]. Each network has an input layer of 40 neurons, 
two hidden layers with 2,4, or 8 memory units, followed by 5 ReLU layers with 
32 neurons, and an output layer of 20 neurons. For each of the networks, we 
use the same 5 input points (40-dimensional time-independent vectors) for com- 
parison. The robustness verification consists on proving that the output label 
after T € {5, 10,15,20} steps in the sequence is still the same, given an adver- 
sarial attack perturbation of e = + 0.01. We compute the reachable sets of all 
reachability instances using the approx-star method, which was able to prove 
the robustness of 19 out of 20 on Noo, and N44 networks, and 18 for the Ngo 
network. We show the average reachability time per T value in Fig. 2a. 


4.4 Semantic Segmentation 


We demonstrate the robustness verification of two SSNNs, one with dilated con- 
volutional layers and the other one with transposed convolutional layers, in addi- 
tion to average pooling, convolutional and ReLU layers, which correspond to N4 
and Ns; introduced in Tablel by Tran et al. [79]. We evaluate them on one 
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random image of M2NIST [18] by attacking each image using an UBAA bright- 
ening attack [79]. One of the main differences of this evaluation with respect 
to the robustness analysis of other classification is the evaluation metrics used. 
For these networks, we evaluate the average robustness values (percentage of 
pixels correctly classified), sensitivity (number of not robust pixels over number 
of attacked pixels), and IoU (intersection over union) of the SSNNs. The compu- 
tation time for the dilated example, shown in Fig. 3, is 54.52 s, with a robustness 
value of 97.2%, a sensitivity of 3.04, and a IoU of 57.8%. For the equivalent exam- 
ple with the transposed network, the robustness value is 98.14%, sensitivity of 
2, IoU of 72.8%, and a computation time of 7.15 s. 


a) Target Image b) Transposed SSNN ) Dilated SSNN 


Fig. 3. Robustness verification of the dilated and transposed SSNN under a UBAA 
brightening attack to 150 random pixels in the input image. 


5 Conclusions 


We presented version 2.0 of NNV, the updated version of the Neural Network 
Verification (NNV) tool [80], a software tool for the verification of deep learning 
models and learning-enabled CPS. To the best of our knowledge, NNV is the 
most comprehensive verification tool in terms of the number of tasks and neural 
networks architectures supported, including the verification of feedforward, con- 
volutional, semantic segmentation, and recurrent neural networks, neural ODEs 
and NNCS. With the recent additions to NNV, we have demonstrated that NNV 
can be a one-stop verification tool for users with a diverse problem set, where ver- 
ification of multiple neural network types is needed. In addition, NNV supports 
zonotope, polyhedron based methods, and up to 6 different star-based reachabil- 
ity methods to handle verification tradeoffs for the verification problem of neural 
networks, ranging from the exact-star, which is sound and complete, but com- 
putationally expensive, to the relax-star methods, which are significantly faster 
but more conservative. We have also shown that NNV outperforms a commer- 
cially available product from MATLAB, which computes the reachable sets of 
feedforward neural networks using the zonotope reachability method presented 
in [66]. In the future, we plan to ensure support for other deep learning models 
such as ResNets [30] and UNets [60]. 
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Abstract. To alleviate the practical constraints for deploying deep neu- 
ral networks (DNNs) on edge devices, quantization is widely regarded as 
one promising technique. It reduces the resource requirements for com- 
putational power and storage space by quantizing the weights and/or 
activation tensors of a DNN into lower bit-width fixed-point numbers, 
resulting in quantized neural networks (QNNs). While it has been empir- 
ically shown to introduce minor accuracy loss, critical verified properties 
of a DNN might become invalid once quantized. Existing verification 
methods focus on either individual neural networks (DNNs or QNNs) 
or quantization error bound for partial quantization. In this work, we 
propose a quantization error bound verification method, named QEB- 
Verif, where both weights and activation tensors are quantized. QEBVerif 
consists of two parts, i.e., a differential reachability analysis (DRA) and 
a mixed-integer linear programming (MILP) based verification method. 
DRA performs difference analysis between the DNN and its quantized 
counterpart layer-by-layer to compute a tight quantization error inter- 
val efficiently. If DRA fails to prove the error bound, then we encode 
the verification problem into an equivalent MILP problem which can 
be solved by off-the-shelf solvers. Thus, QEBVerif is sound, complete, 
and reasonably efficient. We implement QEBVerif and conduct extensive 
experiments, showing its effectiveness and efficiency. 


1 Introduction 


In the past few years, the development of deep neural networks (DNNs) has 
grown at an impressive pace owing to their outstanding performance in solving 
various complicated tasks [23,28]. However, modern DNNs are often large in 
size and contain a great number of 32-bit floating-point parameters to achieve 
competitive performance. Thus, they often result in high computational costs 
and excessive storage requirements, hindering their deployment on resource- 
constrained embedded devices, e.g., edge devices. A promising solution is to 
quantize the weights and/or activation tensors as fixed-point numbers of lower 
© The Author(s) 2023 
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bit-width [17,21,25,35]. For example, TensorFlow Lite [18] supports quantiza- 
tion of weights and/or activation tensors to reduce the model size and latency, 
and Tesla FSD-chip [61] stores all the data and weights of a network in the form 
of 8-bit integers. 

In spite of the empirically impressive results which show there is only minor 
accuracy loss, quantization does not necessarily preserve properties such as 
robustness [16]. Even worse, input perturbation can be amplified by quanti- 
zation [11,36], worsening the robustness of quantized neural networks (QNNs) 
compared to their DNN counterparts. Indeed, existing neural network quan- 
tization methods focus on minimizing its impact on model accuracy (e.g., by 
formulating it as an optimization problem that aims to maximize the accu- 
racy [27,43]). However, they cannot guarantee that the final quantization error 
is always lower than a given error bound, especially when some specific safety- 
critical input regions are concerned. This is concerning as such errors may lead to 
catastrophes when the quantized networks are deployed in safety-critical appli- 
cations [14,26]. Furthermore, analyzing (in particular, quantifying) such errors 
can also help us understand how quantization affect the network behaviors [33], 
and provide insights on, for instance, how to choose appropriate quantization 
bit sizes without introducing too much error. Therefore, a method that soundly 
quantifies the errors between DNNs and their quantized counterparts is highly 
desirable. 

There is a large and growing body of work on developing verification 
methods for DNNs [2,12,13,15,19,24,29,30,32,37,38,51,54,55,58-60,62] and 
QNNs [1,3,16,22,46,66,68], aiming to establish a formal guarantee on the net- 
work behaviors. However, all the above-mentioned methods focus exclusively on 
verifying individual neural networks. Recently, Paulsen et al. [48,49] proposed 
differential verification methods, aimed to establish formal guarantees on the 
difference between two DNNs. Specifically, given two DNNs M; and M2 with the 
same network topology and inputs, they try to prove that |M (x) — No(x)| < € 
for all possible inputs x € X, where æ is the interested input region. They 
presented fast and sound difference propagation techniques followed by a refine- 
ment of the input region until the property can be successfully verified, i.e., the 
property is either proved or falsified by providing a counterexample. This idea 
has been extended to handle recurrent neural networks (RNNs) [41] though the 
refinement is not considered therein. Although their methods [41,48,49] can be 
used to analyze the error bound introduced by quantizing weights (called par- 
tially QNNs), they are not complete and cannot handle the cases where both 
the weights and activation tensors of a DNN are quantized to lower bit-width 
fixed-point numbers (called fully QNNs). We remark that fully QNN can signifi- 
cantly reduce the energy-consumption (floating-point operations consume much 
more energy than integer-only operations) [61]. 


Main Contributions. We propose a sound and complete Quantization Error 
Bound Verification method (QEBVerif) to efficiently and effectively verify if the 
quantization error of a fully QNN w.r.t. an input region and its original DNN 
is always lower than an error bound (a.k.a. robust error bound [33]). QEBVerif 
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first conducts a novel reachability analysis to quantify the quantization errors, 
which is referred to as differential reachability analysis (DRA). Such an analysis 
yields two results: (1) Proved, meaning that the quantization error is proved to 
be always less than the given error bound; or (2) Unknown, meaning that it fails 
to prove the error bound, possibly due to a conservative approximation of the 
quantization error. If the outcome is Unknown, we further encode this quanti- 
zation error bound verification problem into an equivalent mixed-integer linear 
programming (MILP) problem, which can be solved by off-the-shelf solvers. 

There are two main technical challenges that must be addressed for DRA. 
First, the activation tensors in a fully QNN are discrete values and contribute 
additional rounding errors to the final quantization errors, which are hard to 
propagate symbolically and make it difficult to establish relatively accurate dif- 
ference intervals. Second, much more activation-patterns (i.e., 3 x 6 = 18) have 
to consider in a forward propagation, while 9 activation-patterns are sufficient 
in [48,49], where an activation-pattern indicates the status of the output range 
of a neuron. A neuron in a DNN under an input region has 3 patterns: always- 
active (ie., output > 0), always-inactive (ie., output < 0), or both possible. A 
neuron in a QNN has 6 patterns due to the clamp function (cf. Definition 2). 
We remark that handling these different combinations efficiently and soundly is 
highly nontrivial. To tackle the above challenges, we propose sound transforma- 
tions for the affine and activation functions to propagate quantization errors of 
two networks layer-by-layer. Moreover, for the affine transformation, we provide 
two alternative solutions: interval-based and symbolic-based. The former directly 
computes sound difference intervals via interval analysis [42], while the latter 
leverages abstract interpretation [10] to compute sound and symbolic difference 
intervals, using the polyhedra abstract domain. In comparison, the symbolic- 
based one is usually more accurate but less efficient than the interval-based one. 
Note that though existing tools can obtain quantization error intervals by inde- 
pendently computing the output intervals of two networks followed by interval 
subtractions, such an approach is often too conservative. 

To resolve those problems that cannot be proved via our DRA, we resort to 
the sound and complete MILP-based verification method. Inspired by the MILP 
encoding of DNN and QNN verification [39,40,68], we propose a novel MILP 
encoding for verifying quantization error bounds. QEBVerif represents both the 
computations of the QNN and the DNN in mixed-integer linear constraints which 
are further simplified using their own output intervals. Moreover, we also encode 
the output difference intervals of hidden neurons from our DRA as mixed-integer 
linear constraints to boost the verification. 

We implement our method as an end-to-end tool and use Gurobi [20] as 
our back-end MILP solver. We extensively evaluate it on a large set of verifica- 
tion tasks using neural networks for ACAS Xu [26] and MNIST [31], where the 
number of neurons varies from 310 to 4890, the number of bits for quantizing 
weights and activation tensors ranges from 4 to 10 bits, and the number of bits 
for quantizing inputs is fixed to 8 bits. For DRA, we compare QEBVerif with a 
naive method that first independently computes the output intervals of DNNs 
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and QNNs using the existing state-of-the-art (symbolic) interval analysis [22,55], 
and then conducts an interval subtraction. The experimental results show that 
both our interval- and symbolic-based approaches are much more accurate and 
can successfully verify much more tasks without the MILP-based verification. We 
also find that the quantization error interval returned by DRA is getting tighter 
with the increase of the quantization bit size. The experimental results also con- 
firm the effectiveness of our MILP-based verification method, which can help 
verify many tasks that cannot be solved by DRA solely. Finally, our results also 
allow us to study the potential correlation of quantization errors and robustness 
for QNNs using QEBVerif. 


We summarize our contributions as follows: 


— We introduce the first sound, complete, and reasonably efficient quantiza- 
tion error bound verification method QEBVerif for fully QNNs by cleverly 
combining novel DRA and MILP-based verification methods; 

— We propose a novel DRA to compute sound and tight quantization error 
intervals accompanied by an abstract domain tailored to QNNs, which can 
significantly and soundly tighten the quantization error intervals; 

— We implement QEBVerif as an end-to-end open-source tool [64] and conduct 
an extensive evaluation on various verification tasks, demonstrating its effec- 
tiveness and efficiency. 


The source code of our tool and benchmarks are available at https://github. 
com/S3L-official/QEBVerif. Missing proofs, more examples, and experimental 
results can be found in [65]. 


2 Preliminaries 


We denote by R,Z,N and B the sets of real-valued numbers, integers, natu- 
ral numbers, and Boolean values, respectively. Let [n] denote the integer set 
{1,... n} for given n € N. We use BOLD UPPERCASE (e.g., W) and bold 
lowercase (e.g., x) to denote matrices and vectors, respectively. We denote by 
W; j the j-entry in the i-th row of the matrix W, and by x; the i-th entry of 
the vector x. Given a matrix W and a vector x, we use W and x (resp. W and 
x) to denote their quantized/integer (resp. fixed-point) counterparts. 


2.1 Neural Networks 


A deep neural network (DNN) consists of a sequence of layers, where the first 
layer is the input layer, the last layer is the output layer and the others are called 
hidden layers. Each layer contains one or more neurons. A DNN is feed-forward 
if all the neurons in each non-input layer only receive inputs from the neurons 
in the preceding layer. 


Definition 1 (Feed-forward Deep Neural Network). A feed-forward DNN 
N : R” — R° with d layers can be seen as a composition of d functions such 
that N = 1golg_10++-ol,. Then, given an input x € R”, the output of the DNN 
y = N(x) can be obtained by the following recursive computation: 
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- Input layer lı : R” — R™ is the identity function, i.e., xt = l1 (x) =x; 

— Hidden layer l; : R=} — R™ for 2 < i < d— 1 is the function such that 
x? = Lis t) = o(W'xt + bê); 

- Output layer lq: R"4-1 — R° is the function such that y = x° = la(x 
Wix! 4 bt. 


eh) 


where nı =n, WŻ and b? are the weight matrix and bias vector in the i-th layer, 
and ¢(-) is the activation function which acts element-wise on an input vector. 


In this work, we focus on feed-forward DNNs with the most commonly used acti- 
vation functions: the rectified linear unit (ReLU) function, defined as ReLU (x) = 
max(z,0). We also use na to denote the output dimension s. 

A quantized neural network (QNN) is structurally similar to its real-valued 
counterpart, except that all the parameters, inputs of the QNN, and outputs of 
all the hidden layers are quantized into integers according to the given quantiza- 
tion scheme. Then, the computation over real-valued arithmetic in a DNN can 
be replaced by the computation using integer arithmetic, or equally, fixed-point 
arithmetic. In this work, we consider the most common quantization scheme, i.e., 
symmetric uniform quantization [44]. We first give the concept of quantization 
configuration which effectively defines a quantization scheme. 

A quantization configuration C is a tuple (7,Q, fF), where Q and F are the 
total bit size and the fractional bit size allocated to a value, respectively, and 
T E€ {+,+} indicates if the quantized value is unsigned or signed. Given a real 
number xz € R and a quantization configuration C = (7,Q,F), its quantized 
integer counterpart < and the fixed-point counterpart x under the symmetric 
uniform quantization scheme are: 


ĉ = clamp(|2® - x], C, C™) and 2=— 2/2" 


where C = 0 and CX’ = 28 — 1 if r = +, CP = —22-1 and C% = 22-1 — 1 oth- 
erwise, and |-] is the round-to-nearest integer operator. The clamping function 
clamp(«, a,b) with a lower bound a and an upper bound b is defined as: 


a, ifr x< a; 
clamp(z,a,b)= <a, ifa< z< b; 
b, ifx >b. 


Definition 2 (Quantized Neural Network). Given quantization configura- 
tions for the weights, biases, output of the input layer and each hidden layer as 
Cu = (Tia Qu, Fw), Cy = (To, Qb, Fo), Cin = (Tin, Qin, Fin), Ch = (Th, Qh, Fh), 
the quantized version (i.e., QNN) of a DNN N with d layers is a function 
Ñ : Z” —> R® such that N = ig o fac o.o les Then, given a quantized input 
x € Z”, the output of the QNN y = (S) can be obtained by the following 
recursive computation: 

— Input layer i,:Z" + Z™ is the identity function, i.e., x! = ki (x) =x; 

— Hidden layer LE :Z™-1 — Z" for2<i<d-—1 is the function such that for 

each j € [ni], 
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Fig. 1. A 3-layer DNN N-e and its quantized version Nz. 


= clamp( [2 Wi, 1 4 2%»-F.Bi],0, Cx), 


where F; is Fy, — Fy — Fin if i = 2, and — Fu otherwise; . 
- Output layer lq: Z"4-+ — R5 is the function such that ¥ = x4 = 1g(x4-') = 
2 Fw Wagd-l 4 OF. FB, 


where for every2 < i < d and k € [ni-1], Wi, = clamp( |2» W+ rh Sera CY) is 
the quantized weight and bi = clamp(|2"*b‘] „CI, CRP) is the quantized bias. 


We remark that 2% and 2’»~*> in Definition 2 are used to align the precision 
between the inputs and outputs of hidden layers, and F; for i = 2 and i > 2 
because quantization bit sizes for the outputs of the input layer and hidden layers 
can be different. 


2.2 Quantization Error Bound and Its Verification Problem 


We now give the formal definition of the quantization error bound verification 
problem considered in this work as follows. 


Definition 3 (Quantization Error Bound). Given a DNNN : R” = R5, 
the corresponding QNN N : Z” — R5, a quantized input x € Z”, a radius r € N 
and an error bound € € R. The QNNN has a quantization error bound of € w.r.t. 
the input region R(x,r) = {3' € Z” | |X’ — X|loo < r} if for every x' € R(x,r), 


we have ||2~"*\-N(%) — N(x’)|loo < €, where x! = %/(CR — Clb). 


Intuitively, quantization-error-bound is the bound of the output difference of 
the DNN and its quantized counterpart for all the inputs in the input region. In 
this work, we obtain the input for DNN via dividing $’ by (C#? — C}P) to allow 
input normalization. Furthermore, 27%» is used to align the precision between 


the outputs of QNN and DNN. 


Example 1. Consider the DNN Ne with 3 layers (one input layer, one hidden 
layer, and one output layer) given in Fig. 1, where weights are associated with the 
edges and all the biases are 0. The quantization configurations for the weights, 
the output of the input layer and hidden layer are Cu = (+,4, 2), Cin = (+,4, 4) 
and Cn = (+,4,2). Its QNN Ñ+ is shown in Fig. 1. 
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Given a quantized input x = (9,6) and a radius r = 1, the input region for 
QNN Ñ, is R((9,6),1) = {(x,y) € Z2 |8 < x < 10,5 < y < 7}. Since Cu = 15 
and CÈ = 0, by Definitions 1, 2, and 3, we have the maximum quantization error 
as max(2~2N-(%’) — N.(%//15)) = 0.067 for x’ € R((9,6),1). Then, Ñe has a 
quantization error bound of e w.r.t. input region R((9,6),1) for any e > 0.067. 

We remark that if only weights are quantized and the activation tensors 
are floating-point numbers, the maximal quantization error of Ñ, for the input 
region R((9,6), 1) is 0.04422, which implies that existing methods [48,49] cannot 
be used to analyze the error bound for a fully QNN. 


In this work, we focus on the quantization error bound verification problem 
for classification tasks. Specifically, for a classification task, we only focus on the 
output difference of the predicted class instead of all the classes. Hence, given 
a DNN N, a corresponding QNN N, a quantized input x which is classified 
to class g by the DNN W, a radius r and an error bound e€, the quantization 
error bound property P(N, N ,x,7,€) for a classification task can be defined as 
follows: 


Avena (27 N Rg — N(x’) gl] <6) A (x! = 8//(CHP - CR) 
Note that N (-)g denotes the g-th entry of the vector N(-). 


2.3 DEEPPOLY 


We briefly recap DEEPPOLy [55], which will be leveraged in this work for com- 
puting the output of each neuron in a DNN. 

The core idea of DEEPPOLY is to give each neuron an abstract domain in the 
form of a linear combination of the variables preceding the neuron. To achieve 
this, each hidden neuron x’ (the j-th neuron in the i-th layer) = a DNN is 
seen as two nodes xi, and xý 4, such that x49 = Dpr Wi ,.x,4 14 bi (affine 
function) and xi, = ReLU(x xia) (ReLU farce). Then, the affine funetion is 
characterized as an abstract transformer using an upper polyhedral computa- 
tion and a lower polyhedral computation in terms of the variables Xk ae Finally, 
it recursively substitutes the variables in the upper and lower polyhedral com- 
putations with the corresponding upper/lower polyhedral computations of the 
variables until they only contain the input variables from which the concrete 
intervals are computed. 

Formally, the abstract element Ais for the node x’, (s € {0,1}) is a tuple 


Ai = (a ee ave, li Uy s), Where as >; and a$$ bZ are aE a the lower and 
upper polyhedral computata in the form of a linear combination of the vari- 
ables x P's if s = 0 or xå osifs=1, É s E€ R and ws € R are the concrete 
lower and upper bound of the neuron, Then, the concretization of the abstract 
element A‘ sA )={eeR| aye <zra}; =}. 

Coneretely, a iG and a} > are defined as a = = a? > Dea ‘Wi Xe r +bi. 


Furthermore, we can i substitute every variable in a. z. s ieee. a; i2) with 
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its lower (resp. upper) polyhedral computation according to the coefficients until 
no further substitution is possible. Then, we can get a sound lower (resp. upper) 
bound in the form of a linear combination of the input variables based on which 
Ù o (resp. ui o) can be computed immediately from the given input region. 

For ReLU function xj, = ReLU(x} o), there are three cases to consider of 
the abstract element A’ |: 


< i> 

~ If u$ o <0, then avi ay, = 0,1 
_ yji Ae S i> i> i 

If l o > 0, then aft = apg, ají = ajo, Ga =i) and už = ui 93 


=i, <0nuig SO, then aj = as an, o where À € {0,1} 


T 
Ji 045, 


such that the area of resulting shape by a’ = and a; a is minimal, Ġa E AL o 
and w% = ul o: 


Note that DEEPPOLY also introduces transformers for other functions, such 
as sigmoid, tanh, and maxpool functions. In this work, we only consider DNNs 
with only ReLU as non-linear operators. 


3 Methodology of QEBVerif 


In this section, we first give an overview of our quantization error bound verifi- 
cation method, QEBVerif, and then give the detailed design of each component. 


3.1 Overview of QEBVerif 


An overview of QEBVerif is shown in Fig. 2. Given a DNN W, its QNN Ñ, a 
quantization error bound e and an input region consisting of a quantized input x 
and a radius r, to verify the quantization error bound property P(N, N, $, r, €), 
QEBVerif first performs a differential reachability analysis (DRA) to compute 
a sound output difference interval for the two networks. Note that, the differ- 
ence intervals of all the neurons are also recorded for later use. If the output 
difference interval of the two networks is contained in [—e,¢], then the prop- 
erty is proved and QEBVerif outputs “Proved”. Otherwise, QEBVerif leverages 
our MILP-based quantization error bound verification method by encoding the 
problem into an equivalent mixed integer linear programming (MILP) problem 
which can be solved by off-the-shelf solvers. To reduce the size of mixed inte- 
ger linear constraints and boost the verification, QEBVerif independently applies 
symbolic interval analysis on the two networks based on which some activation 
patterns could be omitted. We further encode the difference intervals of all the 
neurons from DRA as mixed integer linear constraints and add them to the MILP 
problem. Though it increases the number of mixed integer linear constraints, it 
is very helpful for solving hard verification tasks. Therefore, the whole verifi- 
cation process is sound, complete yet reasonably efficient. We remark that the 
MILP-based verification method is often more time-consuming and thus the first 
step allows us to quickly verify many tasks first. 
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QNN Analysis for DNN Analysis for QNN 


Fig. 2. An overview of QEBVerif. 


3.2 Differential Reachability Analysis 


Naively, one could use an existing verification tool in the literature to indepen- 
dently compute the output intervals for both the QNN and the DNN, and then 
compute their output difference directly by interval subtraction. However, such 
an approach would be ineffective due to the significant precision loss. 

Recently, Paulsen et al. [48] proposed RELUDIFF and showed that the accu- 
racy of output difference for two DNNs can be greatly improved by propagating 
the difference intervals layer-by-layer. For each hidden layer, they first compute 
the output difference of affine functions (before applying the ReLU), and then 
they use a ReLU transformer to compute the output difference after applying 
the ReLU functions. The reason why RELUDIFF outperforms the naive method 
is that RELUDIFF first computes part of the difference before it accumulates. 
RELUDIFF is later improved to tighten the approximated difference intervals [49]. 
However, as mentioned previously, they do not support fully quantified neural 
networks. Inspired by their work, we design a difference propagation algorithm 
for our setting. We use S’"(x’) (resp. S’”(Xi)) to denote the interval of the j-th 
neuron in the i-th layer in the DNN (resp. QNN) before applying the ReLU func- 
tion (resp. clamp function), and use S(x‘) (resp. $(x)) to denote the output 
interval after applying the ReLU function (resp. clamp function). We use 5!” 
(resp. 6;) to denote the difference interval for the i-th layer before (resp. after) 
applying the activation functions, and use oi”. (resp. 6;,;) to denote the interval 
for the j-th neuron of the i-th layer. We denote by LB(-) and UB(-) the concrete 
lower and upper bounds accordingly. 

Based on the above notations, we give our difference propagation in Algo- 
rithm 1. It works as follows. Given a DNN NV, a QNN N and a quantized input 
region R(x,r), we first compute intervals S’"(x‘) and $(x‘) for neurons in M 
using symbolic interval analysis DEEPPOLY, and compute interval S’"(x‘) and 


S(X) for neurons in N using concrete interval analysis method [22]. Remark that 
no symbolic interval analysis for QNNs exists. By Definition 3, for each quan- 
tized input $’ for QNN, we obtain the input for DNN as x’ = x//(C#> — cP). 


After precision alignment, we get the input difference as 2~fim%’ — x! = 


(2-Fim — 1/(C8P — ClP))%'. Hence, given an input region, we get the output dif- 


ference of the input layer: 5; = (27% —1/(C8> —c!>))9(x1). Then, we compute 


am 


the output difference 6; of each hidden layer iteratively by applying the affine 
transformer and activation transformer given in Algorithm 2 and Algorithm 3. 
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Algorithm 1: Forward Difference Propagation 


Input : DNN N, QNN AN, input region R(ĉ, r) 
output: Output difference interval 6 


1 Compute sim (xi) and S(x;) for i € [d — 1], j € [ni] using DEEPPoty; 
2 Compute Sim (RS) and S(%5) for i € [d — 1], j € [n;] by applying interval analysis [22]; 
3 Initialize the difference: 5; = (2~"in — 1/(c# — C} )) S(x'); 

4 for iin 2,...,d—1do propagate in hidden layers 

5 for j in 1,...,n; do 

diac et emm 

7 63" = ArrTrs(W% 277w Wi, Abi, S(x'~*), 6:1, €); 

8 õi j = ActTRs(5;”,, S” (x$), 27 Fr gi" (xy); 

9 propagate in the output layer 

10 for j inl,...,nq do 

11 AbY = 2-bo — bf; 

12 dg = oa", = ArrTrs(W¢.,, 27 e Wwe Abt, S(x?-1), 54-1, 0); 

13 return (ĝi, j)2<i<d,1<j<ng) 


Algorithm 2: AFFTRS Function 


Input : Weight vector wi .» weight vector Wi, , bias difference Abi , neuron interval 
S(x'—"), difareice interval ĝi—1, otding error € 
output: Difference interval ô; 
1 1b = LB(W% 6:1 + (Wi, — Wi) S(x'71)) + Abi — &; 
2 ub = UB(W% 5:1 + (Wi, — Wi.) S(x'~")) + Abi +£; 
3 return [lb, ub]; 


Finally, we get the output difference for the output layer using only the affine 
transformer. 

Affine Transformer. The difference before applying the activation function for 
the j-th neuron in the i-th layer is: 5! = 2-7» | 2" Wi S(&'-1) + 27-b] — 
Wi .S(x'~')—bi where 27 Fh is used to mien the precision between the outputs of 
the ‘two networks (cf. Sect. 2). Then, we soundly remove the rounding operators 
and give constraints for upper/lower bounds of OH as follows: 


B(67%) < UB(27® (25W Wi S(X-1) + 27-bit + 0.5) — Wi,S(x!-1) — b’) 
B(6i%) > LB(2-** (2 wi SR i=1) + 2Fh-Fobi — 0.5) — Wi S (xê!) — b’) 


Finally, we have UB(6}?) < UB(WŻ S(x!) — WiS(x'"1)) + Abi + € and 
LB(di") > LB(Wi, S(x'-1) — Wi..S(x'-!)) + Abi — £, which can be further 
reformulated as follows: 

B(6i%) < UB(Wi, õi-ı + AW% .S(x*=1)) + Abi + € 

B(6i",) > LB(Wi i1 + AW: S(x x'1)) + Abi —€ 


where $(x‘—!) = 2-¥§(x'1) if i = 2, and 2-*»$(%'-1) otherwise. Wi, = 
2-Fe Wi., AWi, = Wi, — Wi, Abi = 277b} — bi and £ = 2-1, 


QEBVerif: Quantization Error Bound Verification of Neural Networks 423 


Algorithm 3: AcTTRS function 


Input : Difference interval in, neuron interval S*” (x3), neuron interval Ss (x5), clamp 
upper bound t 
output: Difference interval 6;,; 
1 if UB(S*” (x; )) < 0 then lb = clamp(LB(S*” (x $)), 0, t); ub = clamp(UB(S*” (x$ )), 0, t); 
2 else if LB(S*” (xi )) > 0 then 
3 if UB(S'™ (x!) < t and LB(S’"(5)) > 0 then lb = LB(6;"); ub = UB(6;™); 


a else if LB(S'"(X i)) >tor UB(S*” (5 )) < 0 then 

5 lb = clamp(LB(S" (#3), 0, t)— UB(S™ (x})); 

6 ub = clamp(UB(S*" (x 3)) 0, t)— LB(S*” (x4 D; 

7 else if UB(S'" (x žį)) < t then 

8 lb = max(—UB(S*" (x “ys LB(63",)); ub = max(—LB(S*" (x P) UB(5;”,)); 
9 else if LB(S’" (x i)) > 0 then 

10 lb = min(t — UB(S'” (x “)), LB(d;”,)); ub = min(t — LB(S*” (x4)), UB(5;")); 
11 else 

12 lb = max(—UB(S*" (x J); min(t — UB(S*” (x jh LB(5}” ))); 

13 ub = max(—LB(S*" (x 4)), min(t —LB(S*” (x i), UB(5;” ))); 

14 else 

15 if UB(S'"(x')) < t and LB(S’"(x')) > 0 then 

16 lb = min(LB(S*” (x;)), LB(d;”,)); ub = min(UB(S*” (x4 )), UB(;”,)); 

ı7 else if LB(S’"(%5)) > t or UB(S'™ (x!) < 0 then i 

18 lb = clamp(LB(S*" (x;)), 0, t)—UB(S°"(x})); ub = clamp(UB(S*” (x;)), 0, t); 
19 else if UB(S'"(&5)) < t then Sa 

20 lb = max(LB(d;"), —UB(S*” (x}))); ub = min(UB(d;"), UB(S*” (x;))); 

21 if UB(ô}”?,) < 0 then ub = 0; 

22 if LB(6;".) > 0 then lb = 0; 

23 else if LB(S*" (x i)) > 0 then 

24 lb = min(LB(5!"), LB(S*” (x$ )), t -UB(S*" (x “)))s ub = min(UB(6;".), t); 
25 else 

26 lb = min(t —UB(S*"(x3)), 0, max(LB(d;",), —UB(S*"(x4)))); 

27 ub = clamp(UB(4;")), 0, t); 


28 return (lb, ub] N ((S’" (x 3) N [0, t]) — (S(x JA [0, +00))); 


Activation Transformer. Now we give our activation transformer in Algo- 
rithm 3 which computes the difference interval 6;,; from the difference interval 
6;". Note that, the neuron interval $(x‘) for the QNN has already been con- 
verted to the fixed-point counterpart S(xi) = 27 Fn O(& “) as an input parameter, 
as well as the clamping upper bound (t = 2~**C?»). Different from RELUDIFF 
[48] which focuses on the subtraction of two ReLU functions, here we investigate 
the subtraction of the clamping function and ReLU function. 


Theorem 1. If 7, = +, then Algorithm 1 is sound. 


Example 2. We exemplify Algorithm 1 using the networks Me and N. shown 
in Fig. 1. Given quantized input region R((9,6),3) and the corresponding real- 
valued input region R((0.6, 0.4), 0.2), we have $(X}) = [6,12] and $(X}) = 3, 9]. 

First, we get S*” (x?) = S(x?) = [0.36, 0.92], S+? (x2) = [—0.4,0.2], S(x2) = 
[0,0.2] based on DEEPPOLY and S” (R?) = $(%?) = [1,4], gin (x 2) = [-2,1], 
(x3) = [0, 1] via interval sa LB(S™(x DS (OLB 1) —UBG3))/2-*] = 
1, UB(Si" (2) = [(SUB(&}) -LB(&}))/2~4] = 4, LB(Si"(&3)) = [(-3UB(@}) + 
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3LB(x4))/2-*] = —2, and UB(S*"(&2)) = |(— sLB(xi) + 3UB(x 2))/2 =d; 
By Line 3 in Algorithm 1, we have 611 = -yg Sài) = [-0.05, —0.025] 
ô1,2 = — sets 5(%2) = [—0.0375, —0.0125]. 

Then, we compute the E interval before the activation functions. 
The rounding error is € = 2~**~! = 0.125. We obtain the difference intervals 
65", = [-0.194375, 0.133125] and 63", = [—0.204375, 0.123125] as follows based 
on Algorithm 2: 


— LB(63",) = LB(Wt Oia + Wh 261,2 + AWI 1 S(xq) + AWi25(x3)) =$ 
1.25 x LB(é,,1) — 0.25 x UB(ô1 2) + (1.25 — 1.2) x LB(S(xt)) + (0.25 +0.2) 
UB(S(x3)) — 0.125, UB(63%,) = = UB(W: 1ô1,1 + wW: 201, 2 + AWI, ieee) 
AW} 2S (xd)) + € = 1.25 x UB(ô1, 1) — 0.25 x LB(6y, 2) + (1.25 — 1.2) 
UB(S(x})) + (—0.25 + 0.2) x LB(S(xd)) + 0.125; 

7 LB(05"5) T LB(W} 161,1 T W3 251, at AW}, »S(xt) + AW}, 25(x3)) =E= 
—0.75 x UB(ô1,1) + 0.75 x LB(ô1,2) + (—0. 75 + 0. 7) > x UB(S(x +)) + (0.75 — 
0.8) x UB(S(x)) 0.125, UB(6i%) = UB(W3 61,1 +W} 281 2+ AWS, S(x!)+ 
AW} ,5(x})) + € = —0.75 x LB(61,1) + 0.75 x UB(ô12) + (—0.75 + 0.7) x 
LB(S(x1)) + (0.75 — 0.8) x LB(S(xd)) + 0.125. 


x 
+ 
x 


By Lines 20~22 in Algorithm 3, we get the difference intervals after the acti- 
vation functions for the hidden layer as: 62,1 = 63% = [—0.194375, 0.133125], 
d,1 =  [max(LB(dj'),-UB(S'(x3))),min(UB(63%), UB(S'"(%3)))]_ = 
[—0.2, 0.123125]. 

Next, we compute the output difference interval of the networks using 
Algorithm 2 again but with € = 0: LB(d3%) = LB(W? 62,1, + W2 202,2 + 
AW? , (x7) + AW7 2S(x5)) = 0.25 x LB(52,1) +0.75 x LB(5a, 2) + (0.25 — 0.3) x 
UB(S(x7)) + (0.75 — 0.7) x LB(S(x3)), UB(63%) = UB(W? 02,1 + Wi 202,2 + 
AW? S(x7) + AW7 2S(x5)) = 0.25 x UB(62,1) + 0.75 x UB(62,2) + (0.25 — 0.3) x 
LB(S(x?)) + (0.75 — 0.7) x UB($(x3)). Finally, the quantization error interval is 
|-0.24459375, 0.117625]. 


3.3 MILP Encoding of the Verification Problem 


If DRA fails to prove the property, we encode the problem as an equivalent 
MILP problem. Specifically, we encode both the QNN and DNN as sets of (mixed 
integer) linear constraints, and quantize the input region as a set of integer linear 
constraints. We adopt the MILP encodings of DNNs [39] and QNNs [40] to 
transform the DNN and QNN into a set of linear constraints. We use (symbolic) 
intervals to further reduce the size of linear constraints similar to [39] while [40] 
did not. We suppose that the sets of constraints encoding the QNN, DNN, and 
quantized input region are O¢, Oy, and Op, respectively. Next, we give the 
MILP encoding of the robust error bound property. 

Recall that, given a DNN N, an input region R(X,r) such that x is classified 
to class g by N, a QNN N has a quantization error bound e€ w.r.t. R(X,1) if 
for every x’ € R(X,r), we have |2-F»W(2') 4 — N (x')g| < e. Thus, it suffices to 
check if |2-F».N(&") 4 — N (x')g| > € for some x’ € R(x, 1). 
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Let $2 (resp. x4) be the g-th output of N (resp. M). We introduce a real- 
valued variable 7 and a Boolean variable v such that n = max(2~*»xd — x@, 0) 
can be encoded by the set O, eo constraints with an extremely large number M: 

O; = {n > 0, n > 2-Faed — x4 n << M-v, 9 < 2 Peat — xI 4+M- (1—v)}. 
As a result, |2-*»x4 — x >€ if the set of linear constraints O. = O; U {2n — 
(2- Pax? — x3) > a holds. 

Finally, the quantization error bound verification problem is equivalent to the 
solving of the constraints: Op = O g U Ow U Or U Oe. Remark that the output 
difference intervals of hidden neurons obtained from Algorithm 1 can be encoded 
as linear constraints which are added into the set Op to boost the solving. 


4 An Abstract Domain for Symbolic-Based DRA 


While Algorithm 1 can compute difference intervals, the affine transformer 
explicitly adds a concrete rounding error interval to each neuron, which accu- 
mulates into a significant precision loss over the subsequent layers. To alleviate 
this problem, we introduce an abstract domain based on DEEPPOLY which helps 
to compute sound symbolic approximations for the lower and upper bounds of 
each difference interval, hence computing tighter difference intervals. 


4.1 An Abstract Domain for QNNs 


We first introduce transformers for affine transforms with rounding operators 
and clamp functions in QNNs. Recall that the activation function in a QNN 
N is also a min-ReLU function: min(ReLU(|-]), Ch”). Thus, we regard each 
hidden neuron x‘ in a QNN as three nodes Xj, $$ }, and ĝ$ such that $$ o = 
PADD S wi Xi ae abi] (affine function), ŝi = max(X‘ o, 0) (ReLU 
function) and Xf = min(x}, C) (min function). We now give the abstract 


domain Â, = (ai = ane îi p Ô$ p) for each neuron x; „ (p € {0,1,2}) in a QNN 


as follows. , 
Following DEEPPOLY, i a âi? for the affine function of Xj ọ with round- 
ing operators are defined as â? oF pos Wi ge + 2Fr—Fopi — 0.5 and 


as = 2 Wie + 2P- Febi + 0.5. We remark that +0.5 and —0.5 
here are added to soundly encode the rounding operators and have no effect 
on the perseverance of invariant since the rounding operators will add/subtract 
0.5 at most to round each floating-point number into its nearest integer. The 
abstract transformer for the ReLU function xi, = ReLU(xį o) is defined the 
same as DEEPPOLY. pi 

For the min function %} 5 = min(%% 4, CRP), there are three cases for A’ 5: 


i b Ai at, 
~ If, 2 Cp then a's =a 
ay b KAS 

~ If aj) < Cp, then a5 
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Ria Xiz “ 
aN ¢ 
: Y 
Sas oe s 
p”. ji x ọ p” J X ẹ 
D” À, D a. 
A a L ai’ 
gr ` gar 5 
Ri, | gi, 
i t a, Ue a, 
(a) (7, u) = (0, CR°). (b) (7, u) = (1,0). 


ub_ fi ai 
Fig. 3(b) show the two ways where a = er D and 8 = (ia anol? : 
“GAG. 254451 
i A zi ais _ C Üa gi 
- If i, < CH Aa, > Cp, then ayy = AK}. + and a> = SRS, i+ 
ds 
orem 1, Where (A, u) € {(0,C#"), (1,0)} such that the area of resulting 
jl 9,1 


shape by aes and âi 2 is minimal, îi 2= =i , and & a 2= = dai, 1 +p. We show 
the two ways of apprenimiation in Fig. 3. 


Theorem 2. The min abstract transformer preserves the following invariant: 
rt 4,2) e [5 2 85 9). 


From our abstract domain for QNNs, we get a symbolic interval analysis, 
similar to the one for DNNs using DEEPPOLY, to replace Line 2 in Algorithm 1. 


4.2 Symbolic Quantization Error Computation 


Recall that to compute tight bounds of QNNs or DNNs via symbolic interval 
analysis, variables in upper and lower polyhedral computations are recursively 
substituted with the corresponding upper/lower polyhedral computations of vari- 
ables until they only contain the input variables from which the concrete intervals 
are computed. This idea motivates us to design a symbolic difference computa- 
tion approach for differential reachability analysis based on the abstract domain 
DEEPPOLY for DNNs and our abstract domain for QNNs. 

Consider two hidden neurons x5, s and X‘ , from the DNN M and the QNN 
Ñ. Let A= = a, aS iy te 74) and Ai tot =a j ay re D ae p) be their 
abstract elements, respectively, whore all the polyhedral computations are linear 
combinations of the input variables of the DNN and QNN, respectively, i.e., 


a” D a” 2; ye U,*, 
~ = Jk- wE “xi + Hb} = Vee LW, “xi +b; ; 


ans a Lk AL phe aie AURAL f U,* 
= =) pwk Xj, +b; = pe We Rh +b; . 


Then, the sound lower bound Ay and upper Au;* < bound of the difference can 
be derived as follows, where p = Ss: 
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Table 1. Benchmarks for QNNs and DNNs on MNIST. 


NN 
Arch #Paras ONN DNNs 


Q=4 Q=6 Q=8 Q=10 
Pl: 1blk_100 | ~ 79.5k | 96.38% 96.79% 96.77% 96.74% | 96.92% 
P2: 2blk_100 | ~ 89.6k | 96.01% 97.04% 97.00% 97.02% | 97.07% 
P3: 3blk_100 | © 99.7k | 95.53% 96.66% 96.59% 96.68% | 96.71% 
P4: 2blk_512 | & 669.7k | 96.69% 97.41% 97.35% 97.36% | 97.36% 
P5: 4blk_1024 | ~ 3,963k | 97.71% 98.05% 98.01% 98.04% | 97.97% 


ix —F xi — 9—Fp Ai S,* i2,*, 
Al; = LB(2 ni, po ae) =2 mars Tajs 3 
ix -Fps Fr abZ* aust 

Aus _ UB(2 xj, po ie) = =a aj p ajs 


Given a quantized input x of the QNN Ñ, the input difference of two networks 
is 27Fing — x = (2-Fincz> — 1)x. Therefore, we have A} = x} — x} = 27 fing] — 
x} = (27¥inCub — 1)x. Then, the lower bound of difference can be reformulated 
as follows nra only contains the input variables of DNN N: Al = Abi* + 
a wE” +27 Cbi"), where Aby* = 2-F»bi* — bi", F* = Fin — Fh, 
ree ie and wy" = 2F wh". 


Similarly, we can reformulated the upper bound Au; as follows using the 


input variables of the DNN: Aus = Ab¥* + Ypo ( w" + 2-Picebwi*)xt, 
where Ab” = 2-7 Du” a F* = Fin — Fh, and i" = E we 
Finally, we compute the concrete input difference interval Ory based on the 


given input region as 6)" = [LB(Al;"5), UB(Au¥s)], with which we can replace 
the AFF TRS functions in Algorithm 1 directly. An illustrating example is given 
in [65]. 


5 Evaluation 


We have implemented our method QEBVerif as an end-to-end tool written in 
Python, where we use Gurobi [20] as our back-end MILP solver. All floating- 
point numbers used in our tool are 32-bit. Experiments are conducted on a 
96-core machine with Intel(R) Xeon(R) Gold 6342 2.80 GHz CPU and 1 TB 
main memory. We allow Gurobi to use up to 24 threads. The time limit for each 
verification task is 1h. 


Benchmarks. We first build 45 * 4 QNNs from the 45 DNNs of ACAS Xu [26], 
following a post-training quantization scheme [44] and using quantization con- 
figurations Cin = (+,8,8), Cy = Cy = (+,Q,Q — 2), Cn = (+,Q,Q — 2), where 
Q € {4,6,8,10}. We then train 5 DNNs with different architectures using the 
MNIST dataset [31] and build 5 * 4 QNNs following the same quantization 
scheme and quantization configurations except that we set Cin = (+,8,8) and 
Cw = (+,Q, Q — 1) for each DNN trained on MNIST. Details on the networks 
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trained on the MNIST dataset are presented in Table1. Column 1 gives the 
name and architecture of each DNN, where Ablk B means that the network 
has A hidden layers with each hidden layer size B neurons, Column 2 gives the 
number of parameters in each DNN, and Columns 3-7 list the accuracy of these 
networks. Hereafter, we denote by P2-y (resp. Az-y) the QNN using the archi- 
tecture Px (using the z-th DNN) and quantization bit size Q = y for MNIST 
(resp. ACAS Xu), and by Pa-Full (resp. Az-Full) the DNN of architecture Px 
for MNIST (resp. the x-th DNN in ACAS Xu). 


5.1 Effectiveness and Efficiency of DRA 


We first implement a naive method using existing state-of-the-art reachability 
analysis methods for QNNs and DNNs. Specifically, we use the symbolic interval 
analysis of DEEPPOLy [55] to compute the output intervals for a DNN, and 
use interval analysis of [22] to compute the output intervals for a QNN. Then, 
we compute quantization error intervals via interval subtraction. Note that no 
existing methods can directly verify quantization error bounds and the methods 
in [48, 49] are not applicable. Finally, we compare the quantization error intervals 
computed by the naive method against DRA in QEBVerif, using DNNs Ag-Full, 
Py-Full and QNNs Ag-z, Py-z for x = 1, y € {1,2,3,4,5} and z € {4,6,8, 10}. 
We use the same adversarial input regions (5 input points with radius r = 
{3, 6, 13, 19, 26} for each point) as in [29] for ACAS Xu, and set the quantization 
error bound e € {0.05,0.1,0.2,0.3,0.4}, i.e., resulting 25 tasks for each radius. 
For MNIST, we randomly select 30 input samples from the test set of MNIST 
and set radius r = 3 for each input sample and quantization error bound € € 
{1,2,4,6,8}, resulting in a total of 150 tasks for each pair of DNN and QNN of 
same architecture for MNIST. 

Table2 reports the analysis results for ACAS Xu (above) and MNIST 
(below). Column 2 lists different analysis methods, where QEBVerif (Int) is 
Algorithm 1 and QEBVerif (Sym) uses a symbolic-based method for the affine 
transformation in Algorithm 1 (cf. Sect. 4.2). Columns (H_ Diff) (resp. O_ Diff) 
averagely give the sum ranges of the difference intervals of all the hidden neu- 
rons (resp. output neurons of the predicted class) for the 25 verification tasks 
for ACAS Xu and 150 verification tasks for MNIST. Columns (#8/T) list the 
number of tasks (44S) successfully proved by DRA and average computation 
time (T) in seconds, respectively, where the best ones (i.e., solving the most 
tasks) are highlighted in blue. Note that Table 2 only reports the number of true 
propositions proved by DRA while the exact number is unknown. 

Unsurprisingly, QEBVerif (Sym) is less efficient than the others but is still 
in the same order of magnitude. However, we can observe that QEBVerif (Sym) 
solves the most tasks for both ACAS Xu and MNIST and produces the most 
accurate difference intervals of both hidden neurons and output neurons for 
almost all the tasks in MNIST, except for P1-8 and P1-10 where QEBVerif (Int) 
performs better on the intervals for the output neurons. We also find that QEB- 
Verif (Sym) may perform worse than the naive method when the quantization 
bit size is small for ACAS Xu. It is because: (1) the rounding error added into 
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Table 2. Differential Reachability Analysis on ACAS Xu and MNIST. 


r=3 r=6 r=13 r=19 r= 26 


Q Method ; ; ae sie ee " 
H_Diff O_Diff #S/T |H_Diff O_Dif #S/T |H_Diff O_Diff #S/T |H_Diff O_Dif #S/T |H_Diff O_Dif #S/T 


Naive 270.5 0.70 15/0.47 | 423.7 0.99 9/0.52 1,182 4.49 0/0.67 | 6,110 50.91 0/0.79 18,255 186.6 0/0.81 
4 QEBVerif (Int) 270.5 0.70 15/0.49 423.4 0.99 9/0.53 | 1,181 4.46 0/0.70 | 6,044 50.91 0/0.81 17,696 186.6 0/0.85 
QEBVerif (Sym) | 749.4 145.7 0/2.02 780.9 150.2 0/2.11 1,347 210.4 0/2.24 | 6,176 254.7 0/2.35 18,283 343.7 0/2.39 


Naive 268.3 1.43 5/0.47 557.2 4.00 0/0.51 1,258 6.91 0/0.67 | 6,145 53.29 0/0.77 18,299 189.0 0/0.82 
6 QEBVerif (Int) 268.0 1.41 5/0.50 555.0 3.98 0/0.54 | 1,245 6.90 0/0.69 | 6,125 53.28 0/0.80 | 18,218 189.0 0/0.83 
QEBVerif (Sym) | 299.7 2.58 10/1.48 | 365.1 3.53 9/1.59 | 1,032 7.65 5/1.91 | 5,946 85.46 4/2.15 18,144 260.5 0/2.27 


Naive 397.2 3.57 0/0.47 587.7 5.00 0/0.51 1,266 7.90 0/0.67 | 6,160 54.27 0/0.78 18,308 190.0 0/0.81 
8 QEBVerif (Int) 388.4 3.56 0/0.49 560.1 5.00 0/0.53 | 1,222 7.89 0/0.69 | 6,103 54.27 0/0.79 | 18,212 190.0 0/0.83 
QEBVerif (Sym) | 35.75 0.01 24/1.10 | 93.78 0.16 18/1.19 | 845.2 5.84 8/1.65 | 5,832 58.73 5/1.97 | 18,033 209.6 5/2.12 


Naive 394.5 3.67 0/0.49 591.4 5.17 0/0.51 1,268 8.04 0/0.68 | 6,164 54.42 0/0.78 18,312 190.1 0/0.80 
10 QEBVerif (Int) 361.9 3.67 0/0.50 546.2 5.17 0/0.54 | 1,209 8.04 0/0.68 | 6,083 54.42 0/0.79 18,182 190.1 0/0.83 
QEBVerif (Sym) | 15.55 0.01 25/1.04 | 54.29 0.06 22/1.15 | 764.6 4.53 9/1.52 | 5,780 57.21 5/1.91 18,011 228.7 5/2.08 


Pl P2 P3 P4 P5 


Q| Method : À f : ; ; 
H_Diff O_Diff #S/T |H_Diff O_Diff #S/T |H_Diff O_Diff #8/T |H_Dif O_Diff #S/T |H_Diff O_Diff #S/T 


Naive 64.45 7.02 61/0.77 | 220.9 20.27 0/1.53 | 551.6 47.75 0/2.38 | 470.1 22.69 2/11.16 | 5,336 140.4 0/123.0 
4 QEBVerif (Int) 32.86 6.65 63/0.78 194.8 20.27 0/1.54 | 530.9 47.75 0/240) 443.3 22.69 2/11.23 | 5,275 140.4 0/1234 
QEBVerif (Sym) | 32.69 3.14 88/1.31 | 134.9 7.11 49/2.91 | 313.8 14.90 1/5.08 | 365.2 11.11 35/22.28 | 1,864 50.30 1/310.2 


Naive 68.94 7.89 66/0.77 | 249.5 24.25 0/1.52 | 616.2 54.66 0/2.38 | 612.2 31.67 1/11.18 | 7,399 221.0 0/125.4 
6 QEBVerif (Int) 10.33 2.19 115/0.78 89.66 12.81 14/1.54 | 466.0 52.84 0/2.39 | 307.6 20.22 5/11.28 | 7,092 221.0 0/125.1 
QEBVerif (Sym) | 10.18 146 130/1.34| 55.73 3.11 88/2.85 | 131.3 5.33 70/4.72| 158.5 3.99  102/21.85| 861.9 12.67 22/279.9 


Naive 69.15 7.95 64/0.77 | 251.6 24.58 0/1.52 623.1 55.42 0/2.38 | 620.6 32.43 1/11.29 7,542 226.1 0/125.3 
8  QEBVerif (Int) 4.27 0.89 135/0.78 38.87 5.99 66/1.54 | 320.1 40.84 0/2.39 | 134.0 8.99 50/11.24 | 7,109 226.1 0/125.7 
QEBVerif (Sym) | 4.13 1.02 136/1.35| 34.01 2.14 108/2.82| 82.90 3.48 86/4.61| 96.26 2.39  128/21.45| 675.7 6.20 27/273.6 


Naive 69.18 7.96  65/0.77 | 252.0 24.63 0/1.52 | 624.0 55.55 0/2.36 | 620.4 32.40 1/1119 | 7,559 226.9 0/124.2 
QEBVerif (Int) 2.72 0.56 139/0.78 25.39 4.15 79/1.53 | 260.9 34.35 0/2.40 | 84.12 5.75 73/11.26 | 7,090 226.9 0/125.9 
QEBVerif (Sym) | 2.61 0.92 139/1.35| 28.59 1.91 112/2.82| 71.33 3.06 92/4.56| 81.08 2.01 131/21.48 | 646.5 5.68 31/271.5 


I 


the abstract domain of the affine function in each hidden layer of QNNs is large 
due to the small bit size, and (2) such errors can accumulate and magnify layer 
by layer, in contrast to the naive approach where we directly apply the interval 
subtraction. We remark that symbolic-based reachability analysis methods for 
DNNs become less accurate as the network gets deeper and the input region 
gets larger. It means that for a large input region, the output intervals of hid- 
den/output neurons computed by symbolic interval analysis for DNNs can be 
very large. However, the output intervals of their quantized counterparts are 
always limited by the quantization grid limit, i.e., [0, 2554]. Hence, the difference 
intervals computed in Table 2 can be very conservative for large input regions 
and deeper networks. 


5.2 Effectiveness and Efficiency of QEBVerif 


We evaluate QEBVerif on QNNs Aa-z, Py-z for x = 1, y € {1,2,3,4} and 
z € {4,6,8, 10}, as well as DNNs correspondingly. We use the same input regions 
and error bounds as in Sect.5.1 except that we consider r € {3,6,13} for each 
input point for ACAS Xu. Note that, we omit the other two radii for ACAS Xu 
and use medium-sized QNNs for MNIST as our evaluation benchmarks of this 
experiment for the sake of time and computing resources. 

Figure 4 shows the verification results of QEBVerif within 1h per task, which 
gives the number of successfully verified tasks with three methods. Note that 
only the number of successfully proved tasks is given in Fig.4 for DRA due 
to its incompleteness. The blue bars show the results using only the symbolic 
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Fig. 4. Verification Results of QEBVerif on ACAS Xu and MNIST. 


differential reachability analysis, i.e., QEBVerif (Sym). The yellow bars give the 
results by a full verification process in QEBVerif as shown in Fig. 2, i.e., we first 
use DRA and then use MILP solving if DRA fails. The red bars are similar to 
the yellow ones except that linear constraints of the difference intervals of hidden 
neurons got from DRA are added into the MILP encoding. 

Overall, although DRA successfully proved most of the tasks (60.19% with 
DRA solely), our MILP-based verification method can help further verify many 
tasks on which DRA fails, namely, 85.67% with DRA+MILP and 88.59% with 
DRA+MILP-+ Diff. Interestingly, we find that the effectiveness of the added lin- 
ear constraints of the difference intervals varies on the MILP solving efficiency 
on different tasks. Our conjecture is that there are some heuristics in the Gurobi 
solving algorithm for which the additional constraints may not always be helpful. 
However, those difference linear constraints allow the MILP-based verification 
method to verify more tasks, i.e., 79 tasks more in total. 


5.3 Correlation of Quantization Errors and Robustness 


We use QEBVerif to verify a set of properties VY = {P(N, N,%&,r, €)}, 
where N = P1-Full, Ñ € {P1-4, P1-8}, & € Æ and Æ is the set of 
the 30 samples from MNIST as above, r € {3,5,7} and €e € Q = 
{0.5, 1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 5.0}. We solve all the above tasks and process 
all the results to obtain the tightest range of quantization error bounds [a,b] 
for each input region such that a,b € 92. It allows us to obtain intervals that 
are tighter than those obtained via DRA. Finally, we implemented a robustness 
verifier for QNNs in a way similar to [40] to check the robustness of P1-4 and 
P1-8 w.r.t. the input regions given in Y. 

Figure5 gives the experimental results. The blue (resp. yellow) bars in 
Figs. 5(a) and 5(e) show the number of robust (resp. non-robust) samples among 
the 30 verification tasks, and blue bars in the other 6 figures demonstrate the 
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quantization error interval for each input region. By comparing the results of 
P1-8 and P1-4, we observe that P1-8 is more robust than P1-4 w.r.t. the 90 
input regions and its quantization errors are also generally much smaller than 
that of P1-4. Furthermore, we find that P1-8 remains consistently robust as the 
radius increases, and its quantization error interval changes very little. However, 
P1-4 becomes increasingly less robust as the radius increases and its quantiza- 
tion error also increases significantly. Thus, we speculate that there may be some 
correlation between network robustness and quantization error in QNNs. Specif- 
ically, as the quantization bit size decreases, the quantization error increases 
and the QNN becomes less robust. The reason we suspect “the fewer bits, the 
less robust” is that with fewer bits, a perturbation may easily cause significant 
change on hidden neurons (i.e., the change is magnified by the loss of precision) 
and consequently the output. Furthermore, the correlation between the quanti- 
zation error bound and the empirical robustness of the QNN suggests that it is 
indeed possible to apply our method to compute the quantization error bound 
and use it as a guide for identifying the best quantization scheme which balances 
the size of the model and its robustness. 


6 Related Work 


While there is a large and growing body of work on quality assurance tech- 
niques for neural networks including testing (e.g., [4-7,47,50,56,57,63,69]) and 
formal verification (e.g., [2,8, 12,13, 15,19, 24, 29, 30,32,34,37,38,51,54,55,58-60, 
62,70]). Testing techniques are often effective in finding violations, but they 
cannot prove their absence. While formal verification can prove their absence, 
existing methods typically target real-valued neural networks, i.e., DNNs, and 
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are not effective in verifying quantization error bound [48]. In this section, we 
mainly discuss the existing verification techniques for QNNs. 

Early work on formal verification of QNNs typically focuses on 1-bit quan- 
tized neural networks (i.e., BNNs) [3,9,46,52,53,66,67]. Narodytska et al. [46] 
first proposed to reduce the verification problem of BNNs to a satisfiability prob- 
lem of a Boolean formula or an integer linear programming problem. Baluta 
et al. [3] proposed a PAC-style quantitative analysis framework for BNNs via 
approximate SAT model-counting solvers. Shih et al. proposed a quantitative 
verification framework for BNNs [52,53] via a BDD learning-based method [45]. 
Zhang et al. [66,67] proposed a BDD-based verification framework for BNNs, 
which exploits the internal structure of the BNNs to construct BDD models 
instead of BDD-learning. Giacobbe et al. [16] pushed this direction further by 
introducing the first formal verification for multiple-bit quantized DNNs (i.e., 
QNNs) by encoding the robustness verification problem into an SMT formula 
based on the first-order theory of quantifier-free bit-vector. Later, Henzinger et 
al. [22] explored several heuristics to improve the efficiency and scalability of [16]. 
Very recently, [40,68] proposed an ILP-based method and an MILP-based verifi- 
cation method for QNNs, respectively, and both outperform the SMT-based ver- 
ification approach [22]. Though these works can directly verify QNNs or BNNs, 
they cannot verify quantization error bounds. 

There are also some works focusing on exploring the properties of two neural 
networks which are most closely related to our work. Paulsen et al. [48,49] pro- 
posed differential verification methods to verify two DNNs with the same network 
topology. This idea has been extended to handle recurrent neural networks [41]. 
The difference between [41,48,49] and our work has been discussed throughout 
this work, i.e., they focus on quantized weights and cannot handle quantized 
activation tensors. Moreover, their methods are not complete, thus would fail to 
prove tighter error bounds. Semi-definite programming was used to analyze the 
different behaviors of DNNs and fully QNNs [33]. Different from our work focus- 
ing on verification, they aim at generating an upper bound for the worst-case 
error induced by quantization. Furthermore, [33] only scales tiny QNNs, e.g., 1 
input neuron, 1 output neuron, and 10 neurons per hidden layer (up to 4 hidden 
layers). In comparison, our differential reachability analysis scales to much larger 
QNNs, e.g., QNN with 4890 neurons. 


7 Conclusion 


In this work, we proposed a novel quantization error bound verification method 
QEBVerif which is sound, complete, and arguably efficient. We implemented it as 
an end-to-end tool and conducted thorough experiments on various QNNs with 
different quantization bit sizes. Experimental results showed the effectiveness and 
the efficiency of QEBVerif. We also investigated the potential correlation between 
robustness and quantization errors for QNNs and found that as the quantization 
error increases the QNN might become less robust. For further work, it would be 
interesting to investigate the verification method for other activation functions 
and network architectures, towards which this work makes a significant step. 
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Abstract. Deep neural networks (DNNs) are the workhorses of deep 
learning, which constitutes the state of the art in numerous application 
domains. However, DNN-based decision rules are notoriously prone to 
poor generalization, i.e., may prove inadequate on inputs not encountered 
during training. This limitation poses a significant obstacle to employ- 
ing deep learning for mission-critical tasks, and also in real-world envi- 
ronments that exhibit high variability. We propose a novel, verification- 
driven methodology for identifying DNN-based decision rules that gener- 
alize well to new input domains. Our approach quantifies generalization 
to an input domain by the extent to which decisions reached by inde- 
pendently trained DNNs are in agreement for inputs in this domain. We 
show how, by harnessing the power of DNN verification, our approach 
can be efficiently and effectively realized. We evaluate our verification- 
based approach on three deep reinforcement learning (DRL) benchmarks, 
including a system for Internet congestion control. Our results establish 
the usefulness of our approach. More broadly, our work puts forth a 
novel objective for formal verification, with the potential for mitigating 
the risks associated with deploying DNN-based systems in the wild. 


1 Introduction 


Over the past decade, deep learning [35] has achieved state-of-the-art results 
in natural language processing, image recognition, game playing, computational 
biology, and many additional fields [4,18,21,45,50,84,85]. However, despite its 
impressive success, deep learning still suffers from severe drawbacks that limit 
its applicability in domains that involve mission-critical tasks or highly variable 
inputs. 

One such crucial limitation is the notorious difficulty of deep neural networks 
(DNNs) to generalize to new input domains, i.e., their tendency to perform 
poorly on inputs that significantly differ from those encountered while training. 
During training, a DNN is presented with input data sampled from a specific dis- 
tribution over some input domain (“in-distribution” inputs). The induced DNN- 
based rules may fail in generalizing to inputs not encountered during training 
due to (1) the DNN being invoked “out-of-distribution” (OOD), i.e., when there 
is a mismatch between the distribution over inputs in the training data and in 
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the DNN’s operational data; (2) some inputs not being sufficiently represented 
in the finite training data (e.g., various low-probability corner cases); and (3) 
“overfitting” the decision rule to the training data. 

A notable example of the importance of establishing the generalizability of 
DNN-based decisions lies in recently proposed applications of deep reinforce- 
ment learning (DRL) [56] to real-world systems. Under DRL, an agent, realized 
as a DNN, is trained by repeatedly interacting with its environment to learn a 
decision-making policy that attains high performance with respect to a certain 
objective (“reward”). DRL has recently been applied to many real-world chal- 
lenges [20,44,54,55,64-67,96,108]. In many application domains, the learned 
policy is expected to perform well across a daunting breadth of operational 
environments, whose diversity cannot possibly be captured in the training data. 
Further, the cost of erroneous decisions can be dire. Our discussion of DRL-based 
Internet congestion control (see Sect. 4.3) illustrates this point. 

Here, we present a methodology for identifying DNN-based decision rules 
that generalize well to all possible distributions over an input domain of interest. 
Our approach hinges on the following key observation. DNN training in general, 
and DRL policy training in particular, incorporate multiple stochastic aspects, 
such as the initialization of the DNN’s weights and the order in which inputs 
are observed during training. Consequently, even when DNNs with the same 
architecture are trained to perform an identical task on the same data, somewhat 
different decision rules will typically be learned. Paraphrasing Tolstoy’s Anna 
Karenina [93], we argue that “successful decision rules are all alike; but every 
unsuccessful decision rule is unsuccessful in its own way”. Differently put, when 
examining the decisions by several independently trained DNNs on a certain 
input, these are likely to agree only when their (similar) decisions yield high 
performance. 

In light of the above, we propose the following heuristic for generating DNN- 
based decision rules that generalize well to an entire given domain of inputs: 
independently train multiple DNNs, and then seek a subset of these DNNs that 
are in strong agreement across all possible inputs in the considered input domain 
(implying, by our hypothesis, that these DNNs’ learned decision rules generalize 
well to all probability distributions over this domain). Our evaluation demon- 
strates (see Sect. 4) that this methodology is extremely powerful and enables 
distilling from a collection of decision rules the few that indeed generalize better 
to inputs within this domain. Since our heuristic seeks DNNs whose decisions 
are in agreement for each and every input in a specific domain, the decision rules 
reached this way achieve robustly high generalization across different possible 
distributions over inputs in this domain. 

Since our methodology involves contrasting the outputs of different DNNs 
over possibly infinite input domains, using formal verification is natural. To 
this end, we build on recent advances in formal verification of DNNs [2,12, 14, 
16,27,60,78,86, 102]. DNN verification literature has focused on establishing the 
local adversarial robustness of DNNs, i.e., seeking small input perturbations that 
result in misclassification by the DNN [31,36,61]. Our approach broadens the 
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applicability of DNN verification by demonstrating, for the first time (to the best 
of our knowledge), how it can also be used to identify DNN-based decision rules 
that generalize well. More specifically, we show how, for a given input domain, 
a DNN verifier can be utilized to assign a score to a DNN reflecting its level 
of agreement with other DNNs across the entire input domain. This enables 
iteratively pruning the set of candidate DNNs, eventually keeping only those in 
strong agreement, which tend to generalize well. 

To evaluate our methodology, we focus on three popular DRL benchmarks: 
(i) Cartpole, which involves controlling a cart while balancing a pendulum; (ii) 
Mountain Car, which involves controlling a car that needs to escape a valley; 
and (iii) Aurora, an Internet congestion controller. 

Aurora is a particularly compelling example for our approach. While Aurora 
is intended to tame network congestion across a vast diversity of real-world 
Internet environments, Aurora is trained only on synthetically generated data. 
Thus, to deploy Aurora in the real world, it is critical to ensure that its policy 
is sound for numerous scenarios not captured by its training inputs. 

Our evaluation results show that, in all three settings, our verification-driven 
approach is successful at ranking DNN-based DRL policies according to their 
ability to generalize well to out-of-distribution inputs. Our experiments also 
demonstrate that formal verification is superior to gradient-based methods and 
predictive uncertainty methods. These results showcase the potential of our app- 
roach. Our code and benchmarks are publicly available as an artifact accompa- 
nying this work [8]. 

The rest of the paper is organized as follows. Section 2 contains background 
on DNNs, DRLs, and DNN verification. In Sect. 3 we present our verification- 
based methodology for identifying DNNs that successfully generalize to OOD 
inputs. We present our evaluation in Sect. 4. Related work is covered in Sect. 5, 
and we conclude in Sect. 6. 


2 Background 


Deep Neural Networks (DNNs) [35] 
are directed graphs that comprise several 
layers. Upon receiving an assignment of i ReLU 
values to the nodes of its first (input) @—— oO... 
layer, the DNN propagates these values, p< +1 

layer by layer, until ultimately reaching @ 4 @ o< 
the assignment of the final (output) layer. e M 
Computing the value for each node is 

performed according to the type of that 

node’s layer. For example, in weighted- Fig. 1. A toy DNN. 
sum layers, the node’s value is an affine combination of the values of the nodes 
in the preceding layer to which it is connected. In rectified linear unit (ReLU) 
layers, each node y computes the value y = ReLU(a) = max(z,0), where x is a 
single node from the preceding layer. For additional details on DNNs and their 


Weighted 


sum 


Input ReLU Output 
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training see [35]. Figure 1 depicts a toy DNN. For input Vi = [1,2]", the sec- 
ond layer computes the (weighted sum) V2 = [10, —1]7. The ReLU functions are 
subsequently applied in the third layer, and the result is V3 = [10,0]”. Finally, 
the network’s single output is V4 = [20]. 


Deep Reinforcement Learning (DRL) [56] is a machine learning paradigm, 
in which a DRL agent, implemented as a DNN, interacts with an environment 
across discrete time-steps t € 0,1, 2.... At each time-step, the agent is presented 
with the environment’s state s E€ S, and selects an action N(s:) = a E€ A. 
The environment then transitions to its next state s;41, and presents the agent 
with the reward r for its previous action. The agent is trained through repeated 
interactions with its environment to maximize the expected cumulative discounted 
reward Ri = E[ >, - re] (where y € [0,1] is termed the discount factor) [38, 
82,90, 91,97, 107]. 


DNN and DRL Verification. A sound DNN verifier [46] receives as input 
(i) a trained DNN N; (ii) a precondition P on the DNN’s inputs, limiting the 
possible assignments to a domain of interest; and (iii) a postcondition Q on 
the DNN’s outputs, limiting the possible outputs of the DNN. The verifier can 
reply in one of two ways: (i) SAT, with a concrete input x’ for which P(x’) A 
Q(N(z2’)) is satisfied; or (ii) UNSAT, indicating there does not exist such an 2’. 
Typically, Q encodes the negation of N’s desirable behavior for inputs that 
satisfy P. Thus, a SAT result indicates that the DNN errs, and that x’ triggers 
a bug; whereas an UNSAT result indicates that the DNN performs as intended. 
An example of this process appears in Appendix B of our extended paper [7]. 
To date, a plethora of verification approaches have been proposed for general, 
feed-forward DNNs [3, 31,41, 46,61, 99], as well as DRL-based agents that operate 
within reactive environments [5,9, 15,22, 28]. 


3 Quantifying Generalizability via Verification 


Our approach for assessing how well a DNN is expected to generalize on out-of- 
distribution inputs relies on the “Karenina hypothesis”: while there are many 
(possibly infinite) ways to produce incorrect results, correct outputs are likely 
to be fairly similar. Hence, to identify DNN-based decision rules that generalize 
well to new input domains, we advocate training multiple DNNs and scoring the 
learned decision models according to how well their outputs are aligned with 
those of the other models for the considered input domain. These scores can be 
computed using a backend DNN verifier. We show how, by iteratively filtering 
out models that tend to disagree with the rest, DNNs that generalize well can 
be effectively distilled. 

We begin by introducing the following definitions for reasoning about the 
extent to which two DNN-based decision rules are in agreement over an input 
domain. 


Definition 1 (Distance Function). Let O be the space of possible outputs for 
a DNN. A distance function for O is a function d : O x O = RF. 
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Intuitively, a distance function (e.g., the Lı norm) allows us to quantify the 
level of (dis)agreement between the decisions of two DNNs on the same input. 
We elaborate on some choices of distance functions that may be appropriate in 
various domains in Appendix B of our extended paper [7]. 


Definition 2 (Pairwise Disagreement Threshold). Let Ni, No be DNNs 
with the same output space O, let d be a distance function, and let VW be an input 
domain. We define the pairwise disagreement threshold (PDT) of Ni and No 


as: 
a = PDTa (N1, N2) = min {a’ € Rt | Va € Y: d(Ni (x), No(x)) < a’} 


The definition captures the notion that for any input in W, Ny and Nə pro- 
duce outputs that are at most a-distance apart. A small a value indicates that 
the outputs of Nı and Nə are close for all inputs in ¥, whereas a high value 
indicates that there exists an input in W for which the decision models diverge 
significantly. 

To compute PDT values, our approach employs verification to conduct a 
binary search for the maximum distance between the outputs of two DNNs; see 
Algorithm 1. 


Algorithm 1. Pairwise Disagreement Threshold 


Input: DNNs (N;, N;), distance func. d, input domain Y, max. disagreement M > 0 
Output: PDT(Ni, N;) 
low — 0, high — M 
while (low < high) do 
a — $. (low + high) 
query < SMT SOLVER (P — Y, [N;; N;], Q —d(Ni, N;) > a) 
if query is SAT then: low — q 
else if query is UNSAT then: high — & 
end while 
return a 


Pairwise disagreement thresholds can be aggregated to measure the disagree- 
ment between a decision model and a set of other decision models, as defined 
next. 


Definition 3 (Disagreement Score). Let N = {Nji, N2,..., Ng} be a set of 
k DNN-induced decision models, let d be a distance function, and let Ù be an 
input domain. A model’s disagreement score (DS) with respect to N is defined 
as: 


1 


DSN aw (Ni) = Wj- 


XO PDTaw(Ni, Ni) 
jelk] jži 
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Intuitively, the disagreement score measures how much a single decision model 
tends to disagree with the remaining models, on average. 

Using disagreement scores, our heuristic employs an iterative scheme for 
selecting a subset of models that generalize to OOD scenarios—as encoded by 
inputs in W (see Algorithm 2). First, a set of k DNNs {N1, No,..., Ng} are inde- 
pendently trained on the training data. Next, a backend verifier is invoked to 
calculate, for each of the iS) DNN-based model pairs, their respective pairwise- 
disagreement threshold (up to some € accuracy). Next, our algorithm iteratively: 
(i) calculates the disagreement score for each model in the remaining subset of 
models; (ii) identifies the models with the (relative) highest DS scores; and (iii) 
removes them (Line 9 in Algorithm 2). The algorithm terminates after exceed- 
ing a user-defined number of iterations (Line 3 in Algorithm 2), or when the 
remaining models “agree” across the input domain, as indicated by nearly iden- 
tical disagreement scores (Line 7 in Algorithm 2). We note that the algorithm 
is also given an upper bound (M) on the maximum difference, informed by the 
user’s domain-specific knowledge. 


Algorithm 2. Model Selection 
Input: Set of models V = {Ni,...,Nx}, max disagreement M, number of ITERATIONS 
Output: NV’ CN 
1: PDT PAIRWISE DISAGREEMENT THRESHOLDS(V,d,W,M) > table with all PDTs 
2: N'N 
3: for l =1...ITERATIONS do 


A: for N; € N’ do 

5: currentDS[N;] — DSw (Ni, PDT) > based on definition 3 
6: end for 

T: if modelScoresAreSimilar (currentDS) then: break 

8: modelsToRemove +— findModelsWithHighestDS (currentDS) 

9: N” — N” \ modelsToRemove > remove models that tend to disagree 
10: end for 

11: return N’ 


DS Removal Threshold. Different criteria are possible for determining the DS 
threshold above for which models are removed, and how many models to remove 
in each iteration (Line 8 in Algorithm 2). A natural and simple approach, used 
in our evaluation, is to remove the p% models with the highest disagreement 
scores, for some choice of p (25% in our evaluation). Due to space constraints, a 
thorough discussion of additional filtering criteria (all of which proved successful) 
is relegated to Appendix C of our extended paper [7]. 


4 Evaluation 


We extensively evaluated our method using three DRL benchmarks. As discussed 
in the introduction, verifying the generalizability of DRL-based systems is impor- 
tant since such systems are often expected to provide robustly high performance 
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across a broad range of environments, whose diversity is not captured by the 
training data. Our evaluation spans two classic DRL settings, Cartpole [17] and 
Mountain Car [68], as well as the recently proposed Aurora congestion controller 
for Internet traffic [44]. Aurora is a particularly compelling example for a fairly 
complex DRL-based system that addresses a crucial real-world challenge and 
must generalize to real-world conditions not represented in its training data. 


Setup. For each of the three DRL benchmarks, we first trained multiple DNNs 
with the same architecture, where the training process differed only in the ran- 
dom seed used. We then removed from this set of DNNs all but the ones that 
achieved high reward values in-distribution (to eliminate the possibility that a 
decision model generalizes poorly simply due to poor training). Next, we defined 
out-of-distribution input domains of interest for each specific benchmark, and 
used Algorithm 2 to select the models most likely to generalize well on those 
domains according to our framework. To establish the ground truth for how 
well different models actually generalize in practice, we then applied the models 
to OOD inputs drawn from the considered domain and ranked them based on 
their empirical performance (average reward). To investigate the robustness of 
our results, the last step was conducted for varying choices of probability dis- 
tributions over the inputs in the domain. All DNNs used have a feed-forward 
architecture comprised of two hidden layers of ReLU activations, and include 
32-64 neurons in the first hidden layer, and 16 neurons in the second hidden 
layer. 

The results indicate that models selected by our approach are likely to per- 
form significantly better than the rest. Below we describe the gist of our evalua- 
tion; extensive additional information is available in [7]. 


4.1 Cartpole 


Cartpole [33] is a well-known RL 
benchmark in which an agent con- 
trols the movement of a cart with 
an upside-down pendulum (“pole”) 
attached to its top. The cart moves 
on a platform and the agent’s goal is 
to keep the pole balanced for as long 
as possible (see Fig. 2). 


Fig. 2. Cartpole: in-distribution setting 
(blue) and OOD setting (red). (Color figure 
online) 


Agent and Environment. The agent’s inputs are s = (x, vz,0,v9), where x 
represents the cart’s location on the platform, 0 represents the pole’s angle (i.e., 
|0| ~ 0 for a balanced pole, |0| ~ 90° for an unbalanced pole), Vy represents the 
cart’s horizontal velocity and vg represents the pole’s angular velocity. 


In-Distribution Inputs. During training, the agent is incentivized to balance 
the pole, while staying within the platform’s boundaries. In each iteration, the 
agent’s single output indicates the cart’s acceleration (sign and magnitude) for 
the next step. During training, we defined the platform’s bounds to be [—2.4, 2.4], 
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and the cart’s initial position as near-static, and close to the center of the plat- 
form (left-hand side of Fig.2). This was achieved by drawing the cart’s initial 
state vector values uniformly from the range [—0.05, 0.05). 


(OOD) Input Domain. We consider an input domain with larger platforms 
than the ones used in training. To wit, we now allow the x coordinate of the 
input vectors to cover a wider range of [—10, 10]. For the other inputs, we used 
the same bounds as during the training. See [7] for additional details. 


Evaluation. We trained 100% - 
k = 16 models, all 
of which achieved high 
rewards during training 
on the short platform. 
Next, we ran Algorithm 2 
until convergence (7 itera- 
tions, in our experiments) 
on the aforementioned 
input domain, resulting in = GUNG nas 
a set of 3 models. We ss 
then tested all 16 origi- 


nal models using (OOD) Fig. 3. Cartpole: Algorithm 2’s results, per iteration: the 
inputs drawn from the bars reflect the ratio between the good/bad models (left 
new domain, such that y-axis) in the surviving set of models, and the curve indi- 
the generated distribu- cates the number of surviving models (right y-axis). 
tion encodes a novel set- 

ting: the cart is now placed at the center of a much longer, shifted platform (see 
the red cart in Fig. 2). 

All other parameters in the OOD environment were identical to those used 
for the original training. Figure9 (in [7]) depicts the results of evaluating the 
models using 20,000 OOD instances. Of the original 16 models, 11 scored a low- 
to-mediocre average reward, indicating their poor ability to generalize to this 
new distribution. Only 5 models obtained high reward values, including the 3 
models identified by Algorithm 2; thus implying that our method was able to 
effectively remove all 11 models that would have otherwise performed poorly in 
this OOD setting (see Fig. 3). For additional information, see [7]. 


80%- 432 
60%- 


40% - 


total number 


good models 
bad models 


20%- 


set composition by model type 


iteration 


4.2 Mountain Car 


For our second experiment, we evaluated our method on the Mountain Car [79] 
benchmark, in which an agent controls a car that needs to learn how to escape 
a valley and reach a target. As in the Cartpole experiment, we selected a set of 
models that performed well in-distribution and applied our method to identify 
a subset of models that make similar decisions in a predefined input domain. 
We again generated OOD inputs (relative to the training) from within this 
domain, and observed that the models selected by our algorithm indeed general- 
ize significantly better than their peers that were iteratively removed. Detailed 
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information about this benchmark can be found in Appendix E of our extended 
paper [7]. 


4.3 Aurora Congestion Controller 


In our third benchmark, we applied our method to a complex, real-world system 
that implements a policy for Internet congestion control. The goal of congestion 
control is to determine, for each traffic source in a communication network, the 
pace at which data packets should be sent into the network. Congestion control is 
a notoriously difficult and fundamental challenge in computer networking [59, 69]; 
sending packets too fast might cause network congestion, leading to data loss 
and delays. Conversely, low sending rates might under-utilize available network 
bandwidth. Aurora [44] is a DRL-based congestion controller that is the subject 
of recent work on DRL verification [9,28]. In each time-step, an Aurora agent 
observes statistics regarding the network and decides the packet sending rate 
for the following time-step. For example, if the agent observes excellent network 
conditions (e.g., no packet loss), we expect it to increase the packet sending rate 
to better utilize the network. We note that Aurora handles a much harder task 
than classical RL benchmarks (e.g., Cartpole and Mountain Car): congestion 
controllers must react gracefully to various possible events based on nuanced 
signals, as reflected by Aurora’s inputs. Here, unlike in the previous benchmarks, 
it is not straightforward to characterize the optimal policy. 


Agent and Environment. Aurora’s inputs are t vectors v1, ... , Vz, representing 
observations from the t previous time-steps. The agent’s single output value 
indicates the change in the packet sending rate over the next time-step. Each 
vector v; € R includes three distinct values, representing statistics that reflect 
the network’s condition (see details in Appendix F of [7]). In line with previous 
work [9,28,44], we set t = 10 time-steps, making Aurora’s inputs of size 3t = 30. 
The reward function is a linear combination of the data sender’s throughput, 
latency, and packet loss, as observed by the agent (see [44] for additional details). 


In-Distribution Inputs. Aurora’s training applies the congestion controller 
to simple network scenarios where a single sender sends traffic towards a single 
receiver across a single network link. Aurora is trained across varying choices of 
initial sending rate, link bandwidth, link packet-loss rate, link latency, and size 
of the link’s packet buffer. During training, packets are initially sent by Aurora 
at a rate corresponding to 0.3 — 1.5 times the link’s bandwidth. 


(OOD) Input Domain. In our experiments, the input domain encoded a link 
with a shallow packet buffer, implying that only a few packets can accumulate in 
the network (while most excess traffic is discarded), causing the link to exhibit a 
volatile behavior. This is captured by the initial sending rate being up to 8 times 
the link’s bandwidth, to model the possibility of a dramatic decrease in available 
bandwidth (e.g., due to competition, traffic shifts, etc.). See [7] for additional 
details. 
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Evaluation. We ran our algorithm and scored the models based on their dis- 
agreement upon this large domain, which includes inputs they had not encoun- 
tered during training, representing the aforementioned novel link conditions. 


Experiment (1): High Packet Loss. In this experiment, we trained over 100 
Aurora agents in the original (in-distribution) environment. Out of these, we 
selected k = 16 agents that achieved a high average reward in-distribution (see 
Fig. 20a in [7]). Next, we evaluated these agents on OOD inputs that are included 
in the previously described domain. The main difference between the training 
distribution and the new (OOD) ones is the possibility of extreme packet loss 
rates upon initialization. 

Our evaluation over the OOD inputs, within the domain, indicates that 
although all 16 models performed well in-distribution, only 7 agents could suc- 
cessfully handle such OOD inputs (see Fig. 20b in [7]). When we ran Algorithm 2 
on the 16 models, it was able to filter out all 9 models that generalized poorly 
on the OOD inputs (see Fig. 4). In particular, our method returned model {16}, 
which is the best-performing model according to our simulations. We note that 
in the first iterations, the four models to be filtered out were models {1, 2, 6, 13}, 
which are indeed the four worst-performing models on the OOD inputs (see 
Appendix F of [7]). 
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Fig. 4. Aurora: Algorithm 2’s results, per iteration. 


Experiment (2): Additional Distributions over OOD Inputs. To further 
demonstrate that, in the specified input domain, our method is indeed likely to 
keep better-performing models while removing bad models, we reran the previous 
Aurora experiments for additional distributions (probability density functions) 
over the OOD inputs. Our evaluation reveals that all models removed by Algo- 
rithm 2 achieved low reward values also for these additional distributions. These 
results highlight an important advantage of our approach: it applies to all inputs 
within the considered domain, and so it applies to all distributions over these 
inputs. 


Additional Experiments. We also generated a new set of Aurora models 
by altering the training process to include significantly longer interactions. We 
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then repeated the aforementioned experiments. The results (summarized in [7]) 
demonstrate that our approach (again) successfully selected a subset of models 
that generalizes well to distributions over the OOD input domain. 


4.4 Comparison to Additional Methods 


Gradient-based methods [40,53,62,63] are optimization algorithms capable of 
finding DNN inputs that satisfy prescribed constraints, similarly to verification 
methods. These algorithms are extremely popular due to their simplicity and 
scalability. However, this comes at the cost of being inherently incomplete and 
not as precise as DNN verification [11,101]. Indeed, when modifying our algo- 
rithm to calculate PDT scores with gradient-based methods, the results (sum- 
marized in Appendix G of [7]) reveal that, in our context, the verification-based 
approach is superior to the gradient-based ones. Due to the incompleteness of 
gradient-based approaches [101], they often computed sub-optimal PDT values, 
resulting in models that generalize poorly being retained. 


Predictive uncertainty methods [1,74] are online methods for assessing uncer- 
tainty with respect to observed inputs, to determine whether an encountered 
input is drawn from the training distribution. We ran an experiment com- 
paring our approach to uncertainty-prediction-based model selection: we gen- 
erated ensembles [23,30,51] of our original models, and used a variance-based 
metric (motivated by [58]) to identify subsets of models with low output vari- 
ance on OOD-sampled inputs. Similar to gradient-based methods, predictive- 
uncertainty techniques proved fast and scalable, but lacked the precision afforded 
by verification-driven model selection and were unable to discard poorly general- 
izing models. For example, when ranking Cartpole models by their uncertainty 
on OOD inputs, the three models with the lowest uncertainty included also 
“bad” models, which had been filtered out by our approach. 


5 Related Work 


Recently, a plethora of approaches and tools have been put forth for ensur- 
ing DNN correctness [2,6,10,15,19,24-27,29,31,32,34,36,37,41—43, 46-49, 52, 
57,61, 70, 76,81, 83, 86, 87, 89,92, 94,95, 98, 100, 102, 104, 106], including techniques 
for DNN shielding [60], optimization [14,88], quantitative verification [16], 
abstraction [12,13,73,78,86, 105], size reduction [77], and more. Non-verification 
techniques, including runtime-monitoring [39], ensembles [71,72,80,103] and 
additional methods [75] have been utilized for OOD input detection. 

In contrast to the above approaches, we aim to establish generalization guar- 
antees with respect to an entire input domain (spanning all distributions across 
this domain). In addition, to the best of our knowledge, ours is the first attempt 
to exploit variability across models for distilling a subset thereof, with improved 
generalization capabilities. In particular, it is also the first approach to apply 
formal verification for this purpose. 
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6 Conclusion 


This work describes a novel, verification-driven approach for identifying DNN 
models that generalize well to an input domain of interest. We presented an 
iterative scheme that employs a backend DNN verifier, allowing us to score 
models based on their ability to produce similar outputs on the given domain. 
We demonstrated extensively that this approach indeed distills models capable 
of good generalization. As DNN verification technology matures, our approach 
will become increasingly scalable, and also applicable to a wider variety of DNNs. 


Acknowledgements. The work of Amir, Zelazny, and Katz was partially supported 
by the Israel Science Foundation (grant number 683/18). The work of Amir was sup- 
ported by a scholarship from the Clore Israel Foundation. The work of Maayan and 
Schapira was partially supported by funding from Huawei. 


References 


1. Abdar, M., et al.: A review of uncertainty quantification in deep learning: tech- 
niques, applications and challenges. Inf. Fusion 76, 243-297 (2021) 

2. Alamdari, P., Avni, G., Henzinger, T., Lukina, A.: Formal methods with a touch 
of magic. In: Proceedings 20th International Conference on Formal Methods in 
Computer-Aided Design (FMCAD), pp. 138-147 (2020) 

3. Albarghouthi, A.: Introduction to Neural Network Verification (2021). verified- 
deeplearning.com 

4. AlQuraishi, M.: AlphaFold at CASP13. Bioinformatics 35(22), 4862—4865 (2019) 

5. Amir, G., et al.: Verifying learning-based robotic navigation systems. In: Sankara- 
narayanan, S., Sharygina, N. (eds.) Proceedings 29th International Conference on 
Tools and Algorithms for the Construction and Analysis of Systems (TACAS), 
pp. 607-627. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-30823- 
9.31 

6. Amir, G., Freund, Z., Katz, G., Mandelbaum, E., Refaeli, I.: veriFIRE: verify- 
ing an industrial, learning-based wildfire detection system. In: Proceedings 25th 
International Symposium on Formal Methods (FM), pp. 648-656. Springer, Cham 
(2023). https: //doi.org/10.1007/978-3-031-27481-7_38 

7. Amir, G., Maayan, O., Zelazny, O., Katz, G., Schapira, M.: Verifying generaliza- 
tion in deep learning. Technical report (2023). https://arxiv.org/abs/2302.05745 

8. Amir, G., Maayan, O., Zelazny, T., Katz, G., Schapira, M.: Verifying general- 
ization in deep learning: artifact (2023). https://zenodo.org/record/7884514#. 
ZFAz_3ZBy3B 

9. Amir, G., Schapira, M., Katz, G.: Towards scalable verification of deep reinforce- 
ment learning. In: Proceedings 21st Internationl Conference on Formal Methods 
in Computer-Aided Design (FMCAD), pp. 193-203 (2021) 

10. Amir, G., Wu, H., Barrett, C., Katz, G.: An SMT-based approach for verify- 
ing binarized neural networks. In: TACAS 2021. LNCS, vol. 12652, pp. 203-222. 
Springer, Cham (2021). https://doi.org/10.1007/978-3-030-72013-1_11 

11. Amir, G., Zelazny, T., Katz, G., Schapira, M.: Verification-aided deep ensemble 
selection. In: Proceedings 22nd International Conference on Formal Methods in 
Computer-Aided Design (FMCAD), pp. 27-37 (2022) 


450 


12. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


21. 


22. 


23. 


24. 


25. 


26. 


27. 


G. Amir et al. 


Anderson, G., Pailoor, S., Dillig, I., Chaudhuri, S.: Optimization and abstraction: 
a synergistic approach for analyzing neural network robustness. In: Proceedings 
40th ACM SIGPLAN Conference on Programming Languages Design and Imple- 
mentations (PLDI), pp. 731-744 (2019) 

Ashok, P., Hashemi, V., Kretinsky, J., Mohr, S.: DeepAbstract: neural network 
abstraction for accelerating verification. In: Proceedings 18th International Sym- 
posium on Automated Technology for Verification and Analysis (ATVA), pp. 
92-107 (2020) 

Avni, G., Bloem, R., Chatterjee, K., Henzinger, T.A., Konighofer, B., Pranger, 
S.: Run-time optimization for learned controllers through quantitative games. In: 
Dillig, I., Tasiran, S. (eds.) CAV 2019. LNCS, vol. 11561, pp. 630-649. Springer, 
Cham (2019). https://doi.org/10.1007/978-3-030-25540-4-36 

Bacci, E., Giacobbe, M., Parker, D.: Verifying reinforcement learning up to infin- 
ity. In: Proceedings 30th International Joint Conference on Artificial Intelligence 
(IJCAI) (2021) 

Baluta, T., Shen, S., Shinde, S., Meel, K., Saxena, P.: Quantitative verification 
of neural networks and its security applications. In: Proceedings ACM SIGSAC 
Conference on Computer and Communications Security (CCS), pp. 1249-1264 
(2019) 

Barto, A., Sutton, R., Anderson, C.: Neuronlike adaptive elements that can solve 
difficult learning control problems. In: Proceedings of IEEE Systems Man and 
Cybernetics Conference (SMC), pp. 834-846 (1983) 

Bojarski, M., et al.: End to end learning for self-driving cars. Technical report 
(2016). http://arxiv.org/abs/1604.07316 

Bunel, R., Turkaslan, I., Torr, P., Kohli, P., Mudigonda, P.: A unified view of 
piecewise linear neural network verification. In: Proceedings 32nd Conference on 
Neural Information Processing Systems (NeurIPS), pp. 4795-4804 (2018) 

Chen, W., Xu, Y., Wu, X.: Deep reinforcement learning for multi-resource 
multi-machine job scheduling. Technical report (2017). http://arxiv.org/abs/ 
1711.07440 

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: 
Natural language processing (almost) from scratch. J. Mach. Learn. Res. (JMLR) 
12, 2493-2537 (2011) 

Corsi, D., Marchesini, E., Farinelli, A.: Formal verification of neural networks for 
safety-critical tasks in deep reinforcement learning. In: Proceedings 37th Confer- 
ence on Uncertainty in Artificial Intelligence (UAI), pp. 333-343 (2021) 
Dietterich, T.: Ensemble methods in machine learning. In: Proceedings 1st Inter- 
national Workshop on Multiple Classifier Systems (MCS), pp. 1-15 (2020) 
Dong, G., Sun, J., Wang, J., Wang, X., Dai, T.: Towards repairing neural networks 
correctly. Technical report (2020). http://arxiv.org/abs/2012.01872 

Dutta, S., Chen, X., Sankaranarayanan, S.: Reachability analysis for neural feed- 
back systems using regressive polynomial rule inference. In: Proceedings 22nd 
ACM International Conference on Hybrid Systems: Computation and Control 
(HSCC), pp. 157-168 (2019) 

Dutta, S., Jha, S., Sankaranarayanan, S., Tiwari, A.: Learning and verification of 
feedback control systems using feedforward neural networks. IFAC-PapersOnLine 
51(16), 151-156 (2018) 

Ehlers, R.: Formal verification of piece-wise linear feed-forward neural networks. 
In: Proceedings 15th International Symposium on Automated Technology for Ver- 
ification and Analysis (ATVA), pp. 269-286 (2017) 


28. 


29. 


30. 


31. 


32. 


33. 


34. 


35. 
36. 


37. 


38. 


39. 


40. 


41. 


42. 


43. 


44. 


Verifying Generalization in Deep Learning 451 


Eliyahu, T., Kazak, Y., Katz, G., Schapira, M.: Verifying learning-augmented 
systems. In: Proceedings Conference of the ACM Special Interest Group on Data 
Communication on the Applications, Technologies, Architectures, and Protocols 
for Computer Communication (SIGCOMM), pp. 305-318 (2021) 

Fulton, N., Platzer, A.: Safe reinforcement learning via formal methods: toward 
safe control through proof and learning. In: Proceedings 32nd AAAI Conference 
on Artificial Intelligence (AAAI) (2018) 

Ganaie, M., Hu, M., Malik, A., Tanveer, M., Suganthan, P.: Ensemble deep learn- 
ing: a review. Eng. Appl. Artif. Intell. 115, 105151 (2022) 

Gehr, T., Mirman, M., Drachsler-Cohen, D., Tsankov, E., Chaudhuri, S., Vechev, 
M.: AI2: safety and robustness certification of neural networks with abstract inter- 
pretation. In: Proceedings 39th IEEE Symposium on Security and Privacy (S&P) 
(2018) 

Geng, C., Le, N., Xu, X., Wang, Z., Gurfinkel, A., Si, X.: Toward reliable neural 
specifications. Technical report (2022). https://arxiv.org/abs/2210.16114 

Geva, S., Sitte, J.: A cartpole experiment benchmark for trainable controllers. 
IEEE Control Syst. Mag. 13(5), 40-51 (1993) 

Goldberger, B., Adi, Y., Keshet, J., Katz, G.: Minimal modifications of deep 
neural networks using verification. In: Proceedings 23rd Proceedings Conference 
on Logic for Programming, Artificial Intelligence and Reasoning (LPAR), pp. 
260-278 (2020) 

Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016) 
Gopinath, D., Katz, G., Păsăreanu, C., Barrett, C.: DeepSafe: a data-driven app- 
roach for assessing robustness of neural networks. In: Proceedings 16th Inter- 
national Symposium on Automated Technology for Verification and Analysis 
(ATVA), pp. 3-19 (2018) 

Goubault, E., Palumby, S., Putot, S., Rustenholz, L., Sankaranarayanan, S.: Static 
analysis of ReLU neural networks with tropical Polyhedra. In: Proceedings 28th 
International Symposium on Static Analysis (SAS), pp. 166-190 (2021) 
Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: off-policy maxi- 
mum entropy deep reinforcement learning with a stochastic actor. In: Proceedings 
Conference on Machine Learning, pp. 1861-1870. PMLR (2018) 

Hashemi, V., Křetínsky, J., Rieder, S., Schmidt, J.: Runtime monitoring for out- 
of-distribution detection in object detection neural networks. Technical report 
(2022). http://arxiv.org/abs/2212.07773 

Huang, S., Papernot, N., Goodfellow, I., Duan, Y., Abbeel, P.: Adversarial attacks 
on neural network policies. Technical report (2017). https://arxiv.org/abs/1702. 
02284 

Huang, X., Kwiatkowska, M., Wang, S., Wu, M.: Safety verification of deep neu- 
ral networks. In: Proceedings 29th International Conference on Computer Aided 
Verification (CAV), pp. 3-29 (2017) 

Isac, O., Barrett, C., Zhang, M., Katz, G.: Neural network verification with proof 
production. In: Proceedings 22nd International Conference on Formal Methods 
in Computer-Aided Design (FMCAD), pp. 38-48 (2022) 

Jacoby, Y., Barrett, C., Katz, G.: Verifying recurrent neural networks using invari- 
ant inference. In: Proceedings 18th International Symposium on Automated Tech- 
nology for Verification and Analysis (ATVA), pp. 57—74 (2020) 

Jay, N., Rotman, N., Godfrey, B., Schapira, M., Tamar, A.: A deep reinforce- 
ment learning perspective on internet congestion control. In: Proceedings 36th 
International Conference on Machine Learning (ICML), pp. 3050-3059 (2019) 


452 


45. 


46. 


47. 


48. 


49. 


50. 


ol. 


52. 


53. 


54. 


55. 


56. 


57. 


58. 


59. 


60. 


61. 


62. 


G. Amir et al. 


Julian, K., Lopez, J., Brush, J., Owen, M., Kochenderfer, M.: Policy compression 
for aircraft collision avoidance systems. In: Proceedings 35th Digital Avionics 
Systems Conference (DASC), pp. 1-10 (2016) 

Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: an efficient 
SMT solver for verifying deep neural networks. In: Proceedings 29th International 
Conference on Computer Aided Verification (CAV), pp. 97-117 (2017) 

Katz, G., Barrett, C., Dill, D., Julian, K., Kochenderfer, M.: Reluplex: a calculus 
for reasoning about deep neural networks. Formal Methods Syst. Des. (FMSD) 
(2021) 

Katz, G., et al.: The marabou framework for verification and analysis of deep neu- 
ral networks. In: Proceedings 31st International Conference on Computer Aided 
Verification (CAV), pp. 443-452 (2019) 

Konighofer, B., Lorber, F., Jansen, N., Bloem, R.: Shield synthesis for reinforce- 
ment learning. In: Proceedings International Symposium on Leveraging Applica- 
tions of Formal Methods, Verification and Validation (ISoLA), pp. 290-306 (2020) 
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convo- 
lutional neural networks. In: Proceedings 26th Conference on Neural Information 
Processing Systems (NeurIPS), pp. 1097-1105 (2012) 

Krogh, A., Vedelsby, J.: Neural network ensembles, cross validation, and active 
learning. In: Proceedings 7th Conference on Neural Information Processing Sys- 
tems (NeurIPS), pp. 231-238 (1994) 

Kuper, L. Katz, G., Gottschlich, J., Julian, K., Barrett, C., Kochenderfer, M.: 
Toward scalable verification for safety-critical deep networks. Technical report 
(2018). https: //arxiv.org/abs/1801.05950 

Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical 
world. Technical report (2016). http://arxiv.org/abs/1607.02533 

Lekharu, A., Moulii, K., Sur, A., Sarkar, A.: Deep learning based prediction model 
for adaptive video streaming. In: Proceedings 12th International Conference on 
Communication Systems & Networks (COMSNETS), pp. 152-159. IEEE (2020) 
Li, W., Zhou, F., Chowdhury, K.R., Meleis, W.: QTCP: adaptive congestion 
control with reinforcement learning. IEEE Trans. Netw. Sci. Eng. 6(3), 445-458 
(2018) 

Li, Y.: Deep reinforcement learning: an overview. Technical report (2017). http:// 
arxiv.org/abs/1701.07274 

Lomuscio, A., Maganti, L.: An approach to reachability analysis for feed-forward 
ReLU neural networks. Technical report (2017). http://arxiv.org/abs/1706.07351 
Loquercio, A., Segu, M., Scaramuzza, D.: A general framework for uncertainty 
estimation in deep learning. In: Proceedings International Conference on Robotics 
and Automation (ICRA), pp. 3153-3160 (2020) 

Low, S., Paganini, F., Doyle, J.: Internet congestion control. IEEE Control Syst. 
Mag. 22(1), 28-43 (2002) 

Lukina, A., Schilling, C., Henzinger, T.A.: Into the unknown: active monitoring 
of neural networks. In: Feng, L., Fisman, D. (eds.) RV 2021. LNCS, vol. 12974, 
pp. 42-61. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-88494-9_3 
Lyu, Z., Ko, C.Y., Kong, Z., Wong, N., Lin, D., Daniel, L.: Fastened crown: tight- 
ened neural network robustness certificates. In: Proceedings 34th AAAI Confer- 
ence on Artificial Intelligence (AAAI), pp. 5037-5044 (2020) 

Ma, J., Ding, S., Mei, Q.: Towards more practical adversarial attacks on graph 
neural networks. In: Proceedings 34th Conference on Neural Information Process- 
ing Systems (NeurIPS) (2020) 


63. 


64. 


65. 


66. 


67. 


68. 


69. 


70. 


71. 


72. 


73. 


74. 


75. 


76. 


77. 


78. 


79. 


Verifying Generalization in Deep Learning 453 


Madry, A., Makelov, A., Schmidt, L., Tsipras, D., Vladu, A.: Towards deep learn- 
ing models resistant to adversarial attacks. Technical report (2017). http://arxiv. 
org/abs/1706.06083 

Mammadli, R., Jannesari, A., Wolf, F.: Static neural compiler optimization via 
deep reinforcement learning. In: Proceedings 6th IEEE/ACM Workshop on the 
LLVM Compiler Infrastructure in HPC (LLVM-HPC) and Workshop on Hierar- 
chical Parallelism for Exascale Computing (HiPar), pp. 1-11 (2020) 

Mao, H., Alizadeh, M., Menache, I., Kandula, S.: Resource management with 
deep reinforcement learning. In: Proceedings 15th ACM Workshop on Hot Topics 
in Networks (HotNets), pp. 50-56 (2016) 

Mao, H., Netravali, R., Alizadeh, M.: Neural adaptive video streaming with Pen- 
sieve. In: Proceedings Conference of the ACM Special Interest Group on Data 
Communication on the Applications, Technologies, Architectures, and Protocols 
for Computer Communication (SIGCOMM), pp. 197-210 (2017) 

Mnih, V., et al.: Playing Atari with deep reinforcement learning. Technical report 
(2013). https://arxiv.org/abs/1312.5602 

Moore, A.: Efficient Memory-based Learning for Robot Control. University of 
Cambridge (1990) 

Nagle, J.: Congestion control in IP/TCP internetworks. ACM SIGCOMM Com- 
put. Commun. Rev. 14(4), 11-17 (1984) 

Okudono, T., Waga, M., Sekiyama, T., Hasuo, I.: Weighted automata extraction 
from recurrent neural networks via regression on state spaces. In: Proceedings 
34th AAAI Conference on Artificial Intelligence (AAAI), pp. 5037-5044 (2020) 
Ortega, L., Cabañas, R., Masegosa, A.: Diversity and generalization in neural 
network ensembles. In: Proceedings 25th International Conference on Artificial 
Intelligence and Statistics (AISTATS), pp. 11720-11743 (2022) 

Osband, I., Aslanides, J., Cassirer, A.: Randomized prior functions for deep rein- 
forcement learning. In: Proceedings 31st International Conference on Neural Infor- 
mation Processing Systems (NeurIPS), pp. 8617-8629 (2018) 

Ostrovsky, M., Barrett, C., Katz, G.: An abstraction-refinement approach to ver- 
ifying convolutional neural networks. In Proceedings 20th International Sympo- 
sium on Automated Technology for Verification and Analysis (ATVA), pp. 391- 
396 (2022) 

Ovadia, Y., et al.: Can you trust your model’s uncertainty? Evaluating predic- 
tive uncertainty under dataset shift. In: Proceedings 33rd Conference on Neural 
Information Processing Systems (NeurIPS), pp. 14003-14014 (2019) 

Packer, C., Gao, K., Kos, J., Krahenbiihl, P., Koltun, V., Song, D.: Assessing 
generalization in deep reinforcement learning. Technical report (2018). https:// 
arxiv.org/abs/1810.12282 

Polgreen, E., Abboud, R., Kroening, D.: Counterexample guided neural synthesis. 
Technical report (2020). https://arxiv.org/abs/2001.09245 

Prabhakar, P.: Bisimulations for neural network reduction. In: Finkbeiner, B., 
Wies, T. (eds.) VMCAI 2022. LNCS, vol. 13182, pp. 285-300. Springer, Cham 
(2022). https: //doi.org/10.1007 /978-3-030-94583-1_14 

Prabhakar, P., Afzal, Z.: Abstraction based output range analysis for neural net- 
works. Technical report (2020). https: //arxiv.org/abs/2007.09527 

Riedmiller, M.: Neural fitted Q iteration — first experiences with a data efficient 
neural reinforcement learning method. In: Gama, J., Camacho, R., Brazdil, P.B., 
Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 317-328. 
Springer, Heidelberg (2005). https: //doi.org/10.1007/11564096_32 


454 


80. 


81. 


82. 


83. 


84. 


85. 


86. 


87. 


88. 


89. 


90. 


91. 


92. 


93. 
94. 


95. 


96. 


97. 


98. 


G. Amir et al. 


Rotman, N., Schapira, M., Tamar, A.: Online safety assurance for deep reinforce- 
ment learning. In: Proceedings 19th ACM Workshop on Hot Topics in Networks 
(HotNets), pp. 88-95 (2020) 

Ruan, W., Huang, X., Kwiatkowska, M.: Reachability analysis of deep neural net- 
works with provable guarantees. In: Proceedings 27th International Joint Confer- 
ence on Artificial Intelligence (IJCAI) (2018) 

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal pol- 
icy optimization algorithms. Technical report (2017). http://arxiv.org/abs/1707. 
06347 

Seshia, S., et al.: Formal specification for deep neural networks. In: Proceedings 
16th International Symposium on Automated Technology for Verification and 
Analysis (ATVA), pp. 20-34 (2018) 

Silver, D., et al.: Mastering the game of go with deep neural networks and tree 
search. Nature 529(7587), 484—489 (2016) 

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale 
image recognition. Technical report (2014). http://arxiv.org/abs/1409.1556 
Singh, G., Gehr, T., Puschel, M., Vechev, M.: An abstract domain for certifying 
neural networks. In: Proceedings 46th ACM SIGPLAN Symposium on Principles 
of Programming Languages (POPL) (2019) 

Sotoudeh, M., Thakur, A.: Correcting deep neural networks with small, general- 
izing patches. In: Workshop on Safety and Robustness in Decision Making (2019) 
Strong, C., et al.: Global optimization of objective functions represented by ReLU 
networks. J. Mach. Learn., 1-28 (2021) 

Sun, X., Khedr, H., Shoukry, Y.: Formal verification of neural network controlled 
autonomous systems. In: Proceedings 22nd ACM International Conference on 
Hybrid Systems: Computation and Control (HSCC) (2019) 

Sutton, R., Barto, A.: Reinforcement Learning: An Introduction. MIT Press 
(2018) 

Sutton, R., McAllester, D., Singh, S., Mansour, Y.: Policy gradient methods for 
reinforcement learning with function approximation. In: Proceedings 12th Con- 
ference on Neural Information Processing Systems (NeurIPS) (1999) 

Tjeng, V., Xiao, K., Tedrake, R.: Evaluating robustness of neural networks with 
mixed integer programming. Technical report (2017). http://arxiv.org/abs/1711. 
07356 

Tolstoy, L.: Anna Karenina. The Russian Messenger (1877) 

Urban, C., Christakis, M., Wiistholz, V., Zhang, F.: Perfectly parallel fairness 
certification of neural networks. In: Proceedings ACM International Conference on 
Object Oriented Programming Systems Languages and Applications (OOPSLA), 
pp. 1-30 (2020) 

Usman, M., Gopinath, D., Sun, Y., Noller, Y., Pasareanu, C.: NNrepair: 
constraint-based repair of neural network classifiers. Technical report (2021). 
http://arxiv.org/abs/2103.12535 

Valadarsky, A., Schapira, M., Shahaf, D., Tamar, A.: Learning to route with deep 
RL. In: NeurIPS Deep Reinforcement Learning Symposium (2017) 

van Hasselt, H., Guez, A., Silver, D.: Deep reinforcement learning with double Q- 
learning. In: Proceedings 30th AAAI Conference on Artificial Intelligence (AAAI) 
(2016) 

Vasić, M., Petrovic, A., Wang, K., Nikolić, M., Singh, R., Khurshid, S.: MoËT: 
mixture of expert trees and its application to verifiable reinforcement learning. 
Neural Netw. 151, 34—47 (2022) 


99. 


100. 


101. 


102. 


103. 


104. 


105. 


106. 


107. 


108. 


Verifying Generalization in Deep Learning 455 


Wang, S., Pei, K., Whitehouse, J., Yang, J., Jana, S.: Formal security analysis of 
neural networks using symbolic intervals. In: Proceedings 27th USENIX Security 
Symposium, pp. 1599-1614 (2018) 

Wu, H., et al.: Parallelization techniques for verifying neural networks. In: Pro- 
ceedings 20th International Conference on Formal Methods in Computer-Aided 
Design (FMCAD), pp. 128-137 (2020) 

Wu, H., Zelji¢, A., Katz, K., Barrett, C.: Efficient neural network analysis with 
sum-of-infeasibilities. In: Proceedings 28th International Conference on Tools and 
Algorithms for the Construction and Analysis of Systems (TACAS), pp. 143-163 
(2022) 

Xiang, W., Tran, H., Johnson, T.: Output reachable set estimation and verifi- 
cation for multi-layer neural networks. IEEE Trans. Neural Netw. Learn. Syst. 
(TNNLS) (2018) 

Yang, J., Zeng, X., Zhong, S., Wu, S.: Effective neural network ensemble approach 
for improving generalization performance. IEEE Trans. Neural Netw. Learn. Syst. 
(TNNLS) 24(6), 878-887 (2013) 

Yang, X., Yamaguchi, T., Tran, H., Hoxha, B., Johnson, T., Prokhorov, D.: Neu- 
ral network repair with reachability analysis. In: Proceedings 20th International 
Conference on Formal Modeling and Analysis of Timed Systems (FORMATS), 
pp. 221-236 (2022) 

Zelazny, T., Wu, H., Barrett, C., Katz, G.: On reducing over-approximation errors 
for neural network verification. In: Proceedings 22nd International Conference on 
Formal Methods in Computer-Aided Design (FMCAD), pp. 17-26 (2022) 
Zhang, H., Shinn, M., Gupta, A., Gurfinkel, A., Le, N., Narodytska, N.: Verifi- 
cation of recurrent neural networks for cognitive tasks via reachability analysis. 
In: Proceedings 24th European Conference on Artificial Intelligence (ECAI), pp. 
1690-1697 (2020) 

Zhang, J., Kim, J., O’Donoghue, B., Boyd, S.: Sample efficient reinforcement 
learning with REINFORCE. Technical report (2020). https://arxiv.org/abs/2010. 
11364 

Zhang, J., et al.: An end-to-end automatic cloud database tuning system using 
deep reinforcement learning. In: Proceedings of the 2019 International Conference 
on Management of Data (SIGMOD), pp. 415-432 (2019) 


Open Access This chapter is licensed under the terms of the Creative Commons 
Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), 
which permits use, sharing, adaptation, distribution and reproduction in any medium 
or format, as long as you give appropriate credit to the original author(s) and the 
source, provide a link to the Creative Commons license and indicate if changes were 
made. 


The images or other third party material in this chapter are included in the 


chapter’s Creative Commons license, unless indicated otherwise in a credit line to the 
material. If material is not included in the chapter’s Creative Commons license and 
your intended use is not permitted by statutory regulation or exceeds the permitted 
use, you will need to obtain permission directly from the copyright holder. 


Author Index 


A Chatterjee, Krishnendu MI-16, M-86 
Abdulla, Parosh Aziz I-184 Chaudhuri, Swarat M-213 
Akshay, S. 1-266, I-367, I-86 Chechik, Marsha MI-374 
Albert, Elvira III-176 Chen, Hanyue 1-40 
Alistarh, Dan I[-156 Chen, Taolue  III-255 
Alur, Rajeev 1-415 Chen, Yu-Fang HMI-139 
Amilon, Jesper IM-281 Choi, Sung Woo II-397 
Amir, Guy I-438 Chung, Kai-Min IM-139 
An, Jie I-62 Cimatti, Alessandro II-288 
Anand, Ashwani I-436 Cosler, Matthias I-383 
Andriushchenko, Roman IMI-113 Couillard, Eszter MI-437 
Apicelli, Andrew I-27 Czerner, Philipp MI-437 
Arcaini, Paolo I-62 
Asada, Kazuyuki IM-40 D 
Ascari, Flavio I-41 Dardik, Ian 1-326 
Atig, Mohamed Faouzi 1-184 Das, Ankush I-27 

David, Cristina III-459 
B Dongol, Brijesh 1-206 
Badings, Thom M-62 Dreossi, Tommaso I-253 
Barrett, Clark II-163, II-154 Dutertre, Bruno I-187 
Bastani, Favyen 1-459 
Bastani, Osbert 1-415, 1-459 E 
Bayless, Sam I-27 Eberhart, Clovis I-40 
Becchi, Anna II-288 Esen, Zafer M-281 
Beutner, Raven I-309 Esparza, Javier II-437 
Bisping, Benjamin 1-85 
Blicha, Martin II-209 F 
Bonchi, Filippo I-41 Farzan, Azadeh I-109 
Bork, Alexander II-113 Fedorov, Alexander I-156 
Braught, Katherine I-351 Feng, Nick III-374 
Britikov, Konstantin HM-209 Finkbeiner, Bernd II-309 
Brown, Fraser III-154 Fremont, Daniel J. 1-253 
Bruni, Roberto [I-41 Frenkel, Hadar II-309 
Bucey, Mario  III-398 Fu, Hongfei I-16 

Fu, Yu-Fu — [-227, II-329 
C 
Calinescu, Radu I-289 G 
Češka, Milan MI-113 Gacek, Andrew I-27 
Chakraborty, Supratik 1-367 Garcia-Contreras, Isabel I-64 


© The Editor(s) (if applicable) and The Author(s) 2023 
C. Enea and A. Lal (Eds.): CAV 2023, LNCS 13965, pp. 457—460, 2023. 
https://doi.org/10.1007/978-3-031-37703-7 


458 


Gastin, Paul 1-266 

Genaim, Samir _ III-176 

Getir Yaman, Sinem 1-289 
Ghosh, Shromona [-253 
Godbole, Adwait I-184 

Goel, Amit II-187 

Goharshady, Amir Kafshdar I-16 
Goldberg, Eugene I-110 
Gopinath, Divya 1-289 

Gori, Roberta I-41 

Govind, R. 1-266 

Govind, V. K. Hari II-64 
Griggio, Alberto I-288, HI-423 
Guilloud, Simon M-398 
Gurfinkel, Arie I-64 

Gurov, Dilian I-281 


H 

Hahn, Christopher II-383 
Hasuo, Ichiro I-62, I-41, III-40 
Henzinger, Thomas A. II-358 
Hofman, Piotr I-132 

Hovland, Paul D. II-265 
Hückelheim, Jan JJ-265 


I 
Imrie, Calum I-289 


J 

Jaganathan, Dhiva I-27 
Jain, Sahil 1-367 

Jansen, Nils IJ-62 

Jez, Artur I-18 

Johannsen, Chris IJ]-483 
Johnson, Taylor T.  II-397 
Jonas, Martin I[I]-423 
Jones, Phillip 1-483 

Joshi, Aniruddha R. J-266 
Jothimurugan, Kishor 1-415 
Junges, Sebastian IIJ-62, M-113 


K 

Kang, Eunsuk 1-326 

Karimi, Mahyar I-358 
Kashiwa, Shun I-253 

Katoen, Joost-Pieter IMI-113 
Katz, Guy I-438 

Kempa, Brian II-483 
Kiesl-Reiter, Benjamin H-187 


Author Index 


Kim, Edward I-253 

Kirchner, Daniel I-176 
Kokologiannakis, Michalis 1-230 
Kong, Soonho II-187 

Kori, Mayuko I-41 

Koval, Nikita I-156 

Kremer, Gereon I-163 
Křetínský, Jan 1-390 

Krishna, Shankaranarayanan I-184 
Kueffner, Konstantin I[I-358 
Kunéak, Viktor III-398 


L 

Lafortune, Stéphane 1-326 
Lahav, Ori 1-206 

Lengal, Ondřej III-139 
Lette, Danya I-109 

Li, Elaine IMI-350 

Li, Haokun Il-87 

Li, Jianwen I-288 

Li, Yangge 1-351 

Li, Yannan []-335 

Lidström, Christian IIJ-281 
Lin, Anthony W. I-18 

Lin, Jyun-Ao [I-139 

Liu, Jiaxiang II-227, II-329 
Liu, Mingyang  []-255 

Liu, Zhiming I-40 

Lopez, Diego Manzanas II-397 
Lotz, Kevin I[I-187 

Luo, Ziqing I-265 


M 

Maayan, Osher II-438 

Macák, Filip M-113 

Majumdar, Rupak I-187, MI-3, IM-437 
Mallik, Kaushik II-358, HI-3 
Mangal, Ravi 1-289 

Marandi, Ahmadreza I-62 
Markgraf, Oliver M-18 

Marmanis, Iason I-230 

Marsso, Lina M-374 
Martin-Martin, Enrique IMI-176 
Mazowiecki, Filip 1-132 

Meel, Kuldeep S. II-132 
Meggendorfer, Tobias 1-390, I-86 
Meira-Goées, Rômulo 1-326 

Mell, Stephen 1-459 

Mendoza, Daniel I-383 


Author Index 


Metzger, Niklas II-309 
Meyer, Roland I-170 

Mi, Junri I-40 

Milovančević, Dragana II-398 
Mitra, Sayan I-351 


N 
Nagarakatte, Santosh II-226 
Narayana, Srinivas II-226 
Nayak, Satya Prakash 1-436 
Niemetz, Aina II-3 
Nowotka, Dirk II-187 


(0) 

Offtermatt, Philip I-132 
Opaterny, Anton I-170 
Ozdemir, Alex II-163, M-154 


P 

Padhi, Saswat I-27 
Păsăreanu, Corina S. I-289 
Peng, Chao 1-304 

Perez, Mateo I-415 
Preiner, Mathias II-3 
Prokop, Maximilian 1-390 
Pu, Geguang II-288 


R 

Reps, Thomas [I-213 

Rhea, Matthew I-253 

Rieder, Sabine I-390 
Rodríguez, Andoni IMI-305 
Roy, Subhajit II-190 

Rozier, Kristin Yvonne MI-483 
Rümmer, Philipp I-18, I-281 
Rychlicki, Mateusz HI-3 


S 

Sabetzadeh, Mehrdad III-374 

Sanchez, César II-305 
Sangiovanni-Vincentelli, Alberto L. 1-253 
Schapira, Michael I-438 

Schmitt, Frederik [I-383 

Schmuck, Anne-Kathrin [-436, III-3 
Seshia, Sanjit A. 1-253 

Shachnai, Matan  III-226 

Sharma, Vaibhav I-27 


Sharygina, Natasha II-209 
Shen, Keyi 1-351 

Shi, Xiaomu — [I-227, M-329 
Shoham, Sharon I-64 
Siegel, Stephen F. I-265 
Sistla, Meghana MI-213 
Sokolova, Maria I-156 
Somenzi, Fabio I-415 
Song, Fu H-413, MI-255 
Soudjani, Sadegh IMI-3 
Srivathsan, B. 1-266 
Stanford, Caleb [-241 
Stutz, Felix MI-350 

Su, Yu I-40 

Sun, Jun IJI-413 

Sun, Yican I-16 


T 

Takhar, Gourav III-190 
Tang, Xiaochao I-304 
Tinelli, Cesare II-163 
Topcu, Ufuk I-62 

Tran, Hoang-Dung I-397 
Tripakis, Stavros 1-326 
Trippel, Caroline II-383 
Trivedi, Ashutosh I-415 
Tsai, Ming-Hsien _ II-227, M-329 
Tsai, Wei-Lun III-139 
Tsitelov, Dmitry 1-156 


V 

Vafeiadis, Viktor 1-230 
Vahanwala, Mihir I-184 

Veanes, Margus -241 

Vin, Eric I-253 

Vishwanathan, Harishankar IMI-226 


W 

Waga, Masaki I-3 

Wahby, Riad S. M-154 
Wang, Bow-Yaw II-227, II-329 
Wang, Chao II-335 

Wang, Jingbo I-335 
Wang, Meng III-459 
Watanabe, Kazuki I-40 
Wehrheim, Heike I-206 
Whalen, Michael W. I-27 
Wies, Thomas I-170, HI-350 


459 


460 


Wolff, Sebastian I-170 
Wu, Wenhao — II-265 


X 
Xia, Bican Il-87 
Xia, Yechuan Il-288 


Y 
Yadav, Raveesh I-27 


Yang, Bo-Yin I-227, HI-329 


Yang, Jiong Il-132 
Yang, Zhengfeng 1-304 
Yu, Huafeng I-289 

Yu, Yijun MI-459 

Yue, Xiangyu 1-253 


Z 

Zdancewic, Steve I[-459 
Zelazny, Tom I-438 
Zeng, Xia 1-304 

Zeng, Zhenbing 1-304 
Zhang, Hanliang II-459 
Zhang, Li 1-304 
Zhang, Miaomiao I-40 
Zhang, Pei IMI-483 
Zhang, Yedi I[I-413 
Zhang, Zhenya 1-62 
Zhao, Tianqi [I-87 
Zhu, Haoging 1-351 
Žikelić, Đorđe II-86 
Zufferey, Damien  III-350 


Author Index 


